Computational Life Science Seminar

class: center, middle, title-slide

# Computational Life Science Seminar
### Stefan Schmutz
### 2020-11-16

---

background-image: url(figures/title.png)
background-position: center
background-size: 75%

<style>
p.caption {
  font-size: 0.6em;
  font-style: italic;
}
</style>

???
Hi everyone. Please interrupt me anytime if you'd like to ask something.

Today, I'd like to present this article by Kelleher and colleagues which was published last year and carries the title "Inferring whole-genome hisories in large population data sets"

---
class: center, middle

# What does this **Title** reveal?

???
What does it reveal about the contents of the article?

As it turns out quite a lot.

Let's have a look at it bit-by-bit.

---

## .bg-washed-red.b--light-red.ba.bw2.br3.ph2.pv1[Inferring] .moon-gray[.bg-near-white.b--moon-gray.ba.bw2.br3.ph2.pv1[whole-genome] .bg-near-white.b--moon-gray.ba.bw2.br3.ph2.pv1[histories] in .bg-near-white.b--moon-gray.ba.bw2.br3.ph2.pv1[large population datasets]]

<br/>

<div class="figure" style="text-align: center">
<img src="figures/inferring_adjusted2.jpg" alt="source: thesaurus.com" width="500px" />
<p class="caption">source: thesaurus.com</p>
</div>

???
*Inferring* in this context means that:

The authors apply a model (result might therefore not be 100% accurate, but it's a **estimate/hypothesis**)

---

## .moon-gray[.bg-near-white.b--moon-gray.ba.bw2.br3.ph2.pv1[Inferring]] .bg-washed-yellow.b--yellow.ba.bw2.br3.ph2.pv1[whole-genome] .moon-gray[.bg-near-white.b--moon-gray.ba.bw2.br3.ph2.pv1[histories] in .bg-near-white.b--moon-gray.ba.bw2.br3.ph2.pv1[large population datasets]]

<div class="figure" style="text-align: center">
<img src="figures/Karyotype.png" alt="source: National Human Genome Research Institute" width="400px" />
<p class="caption">source: National Human Genome Research Institute</p>
</div>

???
Next, *whole-genome*

Here they work with whole-genome data from humans.  
The human genome (blueprint for an individual) consists of 23 chromosome pairs (it's diploid), shown in this figure, which are made of DNA.

DNA in turn consists of four bases (A, T, G, C) and the order of those bases (also called sequence) can be determined using DNA sequencing methods.

---

## .moon-gray[.bg-near-white.b--moon-gray.ba.bw2.br3.ph2.pv1[Inferring] .bg-near-white.b--moon-gray.ba.bw2.br3.ph2.pv1[whole-genome]] .bg-washed-green.b--dark-green.ba.bw2.br3.ph2.pv1[histories] .moon-gray[in .bg-near-white.b--moon-gray.ba.bw2.br3.ph2.pv1[large population datasets]]

???
*Histories* in terms of DNA evolution.

Over time the DNA sequence can change (evolutionary process).  
Observing these changes of the DNA allows us to reconstruct history (genealogy).

A way to visualize this history is through phylogenetic trees.

<br/>

<div class="figure" style="text-align: center">
<img src="figures/fig_1_b_wo_caption_adapted.png" alt="source: Adapted from Kelleher et al., 2019" width="500px" />
<p class="caption">source: Adapted from Kelleher et al., 2019</p>
</div>

???
Let's look at the following scenario, an example of a phylogenetic tree:  
- we have gathered genetic sequence information of 5 individuals (*a-e*)
- we compare those 5 sequences with each other and list the differences (*orange diamonds*)
- based on those differences we can infer the history and common ancestors (*f-i*)

This model therefore predicts, that the earliest common ancestor of those 5 individuals is *i*.  
On the left hand side there happened changes at position 5 and 1 which led to an individual *h* and so on.

Most importantly, a tree can describe real biology.  
Nodes are individuals and edges are connections between them.

---

<br/>

???
As mentioned before, DNA can change over time (mutates).

Such mutations here shown in red are derived from an ancestral sequence (in grey) or is passed on from either parent.

As seen on the previous slide, we can use these mutations to reconstruct history.

In this example we see a short (diploid) sequence of three individuals, two parents and a child.

If we look at C1, it is likely inherited from P1.  
For C2 it's a bit more complex. In addition to the G6T mutation which was not observed before, the first half seems to come from P2 while the second half likely from M2.

Unfortunately, in addition to mutations this is an example of another process which makes reconstructing history difficult: ...

---

<br/>

<div class="figure" style="text-align: center">
<img src="figures/recombination.png" alt="source: genome.gov" width="600px" />
<p class="caption">source: genome.gov</p>
</div>

???
...Recombination

Turns out that paternal and maternal genetic material can cross over (recombine) as depicted here  
As this happens over multiple generations, a mosaic is created.

> Chromosomes can be thought of as mosaics made up of material inherited from multiple ancestors.  
> \- Wilder Wohns

Representing history with a single (accurate) tree is therefore not possible anymore.  
The different parts of the genome have different histories.

**Special methods are needed to solve this issue, which is the main thing this article describes**

---

## .moon-gray[.bg-near-white.b--moon-gray.ba.bw2.br3.ph2.pv1[Inferring] .bg-near-white.b--moon-gray.ba.bw2.br3.ph2.pv1[whole-genome] .bg-near-white.b--moon-gray.ba.bw2.br3.ph2.pv1[histories] in] .bg-lightest-blue.b--blue.ba.bw2.br3.ph2.pv1[large population datasets]

<br/>

<div class="figure" style="text-align: center">
<img src="index_files/figure-html/sequencing-costs-1-1.png" alt="source: genome.gov/sequencingcosts" width="720" />
<p class="caption">source: genome.gov/sequencingcosts</p>
</div>

???
Last part of the title *large population datasets*

Here you see the trend of cost per human genome sequenced and the time point where a first draft of the human genome was finished.  
The decrease in price since then is enormous.

Note that the y-axis is shown as log scale!   
Cost per Human Genome decreased drastically and is now below $1'000.

What happened?

---

<br/>

<div class="figure" style="text-align: center">
<img src="index_files/figure-html/sequencing-costs-2-1.png" alt="source: genome.gov/sequencingcosts" width="720" />
<p class="caption">source: genome.gov/sequencingcosts</p>
</div>

???

In recent years, new methods were developed to sequence DNA much faster and cheaper.  
So called high throughput or also next generation sequencing.

---

<br/>

???
A consequence of that is, that the size of available data sets grew at a high rate.

Here shown are the sample numbers of the three data sets used for proof of concept for this article.

With larger data sets, more efficient algorithms (in respect to speed and storage space) are needed.

---
class: center
# Current situation

<br/>

.pull-left[ 
Tree
<div class="figure" style="text-align: center">
<img src="figures/morrison_2016_tree.svg" alt="source: Adapted from Morrison, 2016" width="400px" />
<p class="caption">source: Adapted from Morrison, 2016</p>
</div>
]

???
In order to understand their proposed solution, let's look at the situation so far.

- There exist many methods to infer phylogenetic trees from genetic DNA sequences  
- They however exclude/neglect the possibility of recombination (by e.g. building (gene)trees from short fragments where recombination can be excluded, or assumed to be minimal)

--
.pull-right[
Network
<div class="figure" style="text-align: center">
<img src="figures/morrison_2016_network.svg" alt="source: Adapted from Morrison, 2016" width="400px" />
<p class="caption">source: Adapted from Morrison, 2016</p>
</div>
]

???
If however, we don't want to neglect recombination events, a different data structure is needed.

- A network can represent such structures -> as in e.g. an ancestral recombination graph (ARG)  
- ARGs are computationally expensive and therefore limited to a few tens of samples  
- ARGs are therefore rarely used in practice

---
class: center, middle

# What's the Author's proposed **Solution**?

???
There's clearly room for new methods which overcome the current limitations.

---
class: center, middle

<div class="figure" style="text-align: center">
<img src="figures/tskit_logo.svg" alt="source: github.com/tskit-dev" width="400px" />
<p class="caption">source: github.com/tskit-dev</p>
</div>

???
The authors previously developed a new data structure and provided a library of tools to work with it.  
This data structure is called **tree sequence**.

ts stands for tree sequence, but what is that?  
We saw what a tree is, so a tree sequence is multiple trees but lets look at it in more detail

>The **tskit** library provides the underlying functionality used to load, examine, and manipulate tree sequences.  
>
>**tskit** has both a Python and C API (Application Programming Interface).

---
## tree sequence

.pull-left[

<div class="figure" style="text-align: center">
<img src="figures/kelleher_2018_fig_3_pt1.png" alt="source: Kelleher et al., 2018" width="400px" />
<p class="caption">source: Kelleher et al., 2018</p>
</div>
]

???
How can we imagine a tree sequence?

Equivalent to ancestral recombination graph (ARG)  
But there's one tree for each part of the genome (separated by recombination events) since they have different ancestral histories

For example:  
We have the DNA sequence information of three individuals (0-2)  
3 and 4 are ancestors, red stars are mutation events

On the bottom we see a genome of length 10 where at position 5 recombination happened  
Since there's one recombination breakpoint, we can represent the data with two trees (shown on top)

.pull-right[

<div class="figure" style="text-align: center">
<img src="figures/kelleher_2018_fig_3_pt2.png" alt="source: Kelleher et al., 2018" width="440px" />
<p class="caption">source: Kelleher et al., 2018</p>
</div>
]

???
These trees can also be represented as the following data structure (everything which is needed is listed in those 4 tables)   
One can go from representation on the left to right and vice versa  
It's optimized for storage size use by discarding redundant information (succinct)

This data structure allows that (Population genetics) statistics can be computed efficiently (eg. nucleotide diversity `$\pi$`)

---
class: center, middle

# `tsinfer`

???
The work described in this article is based on this data structure
More specifically `tsinfer` (which is part of `tskit`)

tsinfer is a method which makes use of tree sequence structure from genome variation data without the previously mentioned barriers of ARGs

---

## Building ancestors and inferring edges

<br/>

<div class="figure" style="text-align: center">
<img src="figures/suppl_fig_17_top_adapted.png" alt="source: Adapted from Kelleher et al., 2019" width="600px" />
<p class="caption">source: Adapted from Kelleher et al., 2019</p>
</div>

???
First, `tsinfer` builds/infers ancestral haplotypes (as shown here)  
these represent genomes of ancestors

Second, haplotypes are arranged and edges (links between nodes) are inferred (trees are built)

Main thing is: `tsinfer` reconstructs trees (output is tree sequence)

---
class: center, middle

# Application example

???
Where/how could ancestry inference be interesting?

Ancestry inference is of fundamental biological interest, how did evolution happen?

---

# UK Biobank population structure

<div class="figure" style="text-align: center">
<img src="figures/fig_5_a_b.png" alt="source: Kelleher et al., 2019" width="900px" />
<p class="caption">source: Kelleher et al., 2019</p>
</div>

???
In this example the added information about the geographic origin of the individuals and plotted a heat map which shows the genealogical nearest neighbors.

As expected, people from certain areas seem to be connected to others from a similar area.

The authors have shown, on this large, real data set, that their method `tsinfer` returns a reasonable result and fulfills the aim of an efficient algorithm.

---

# Limitations

<br/>

### phased and aligned sequences as input

### assumed each mutation has a single origin

???
no independant (same) mutation events assumed (no homoplasic substitutions), recurrent and back mutations will (currently) not be handled well

### mutation/recombination ratio has to be sufficiently high

???
more mutations compared to recombinations, in order to use mutations as starting point for ancestor inference  
If there's no mutation, we can't predict recombination events (they are a marker for it)

### only ordering of tree nodes (relative age)

???
result is only relative age  
methods for dating genomic variants could be used (see next slide)

---

# Limitations
.center[
## **Cladogram** vs. **Phylogram** vs. **Chronogram**
]

<br/>

<div class="figure" style="text-align: center">
<img src="figures/cladogram_phylogram_chronogram.png" alt="source: Riutort, 2016" width="800px" />
<p class="caption">source: Riutort, 2016</p>
</div>

???
The current output of `tsinfer` is a cladogram. Meaning nodes are arranged relative to each other but not according time or nucleotide differences.

---

# Outlook
### From topologies to branch lengths (`tsdate`)

???
There's already ongoing work to implement this by one of the Authors (Wilder Wohns)

### Possible application for genomes of other Species

.pull-left[

<img src="figures/recombination_should_not_be_an_afterthought.png" width="600px" style="display: block; margin: auto;" />
]

.pull-right[

<img src="figures/coalre.png" width="600px" style="display: block; margin: auto;" />
]

???
Not restricted to human genomes, applied to *P. vivax* Genome Variation Project, also possible to apply for Viral genomes?!  
The need seems to be present since viruses also recombine/reassort, see those two examples

---
class: center, middle

## .bg-washed-red.b--light-red.ba.bw2.br3.ph2.pv1[Inferring] .bg-washed-yellow.b--yellow.ba.bw2.br3.ph2.pv1[whole-genome] .bg-washed-green.b--dark-green.ba.bw2.br3.ph2.pv1[histories] in .bg-lightest-blue.b--blue.ba.bw2.br3.ph2.pv1[large population datasets]

???
In this article, the Authors present a method to do better inference of genetic histories by also considering recombination events of genomes.

In addition to good accuracy compared state-of-the-art tools, `tsinfer` is also very well scalable, fast and space efficient.

They even argued that it's therefore hypothetically even possible to "Infer the ancestry of everyone" as the title of their preprint was modestly called

---

---
class: center, middle

# Appendix

---
# (Dis)Advantages of open source code and data

.pull-left[
### &#x1f44d;

.can-edit.key-likes[
- replicability of results  
- 
]
]

.pull-right[
### &#x1f44e;

.can-edit.key-dislikes[
- sensitive data not protected  
- 
]
]

---

## Comparison to state-of-the-art .pull-right[**Storage Space**]

<br/>

<div class="figure" style="text-align: center">
<img src="figures/fig_1_c.png" alt="source: Kelleher et al., 2019" width="800px" />
<p class="caption">source: Kelleher et al., 2019</p>
</div>

???
Space needed to store genomic variation data.

VCF a popular format, stores the variants in matrices (see panel a), the complexity is O(n(nodes/samples) x m(variant sites))

Tree sequence encoding on the other hand (shown in panel b) is more space efficient, the complexity is O(n(nodes) + m(variant sites) + r(recombinations/new nodes))

As a result, file size of output remains much lower compared to VCF and is simmilar to Positional Burrows Wheeler Transform (PBWT) another recently developed data structure

---

## Comparison to state-of-the-art .pull-right[**Accuracy** and **Speed**]

<br/>

.pull-left[
<div class="figure" style="text-align: center">
<img src="figures/fig_3_wo_genotyping_error.png" alt="source: Kelleher et al., 2019" width="470px" />
<p class="caption">source: Kelleher et al., 2019</p>
</div>
]

.pull-right[
<div class="figure" style="text-align: center">
<img src="figures/suppl_fig_8.png" alt="source: Kelleher et al., 2019" width="500px" />
<p class="caption">source: Kelleher et al., 2019</p>
</div>
]

???
comparable accuracy (using Kendall-Colijn (KC) metric, lower values indicate greater accuracy)

and speed when compared to state-of-the-art

---
class: center, middle

<div class="figure" style="text-align: center">
<img src="figures/fig_2.png" alt="source: Kelleher et al., 2019" width="700px" />
<p class="caption">source: Kelleher et al., 2019</p>
</div>

---
class: center, middle

<div class="figure" style="text-align: center">
<img src="figures/fig_4.png" alt="source: Kelleher et al., 2019" width="500px" />
<p class="caption">source: Kelleher et al., 2019</p>
</div>

---
class: center, middle

<div class="figure" style="text-align: center">
<img src="figures/suppl_fig_2.png" alt="source: Kelleher et al., 2019" width="700px" />
<p class="caption">source: Kelleher et al., 2019</p>
</div>

---
class: center, middle

<div class="figure" style="text-align: center">
<img src="figures/suppl_fig_19.png" alt="source: Kelleher et al., 2019" width="700px" />
<p class="caption">source: Kelleher et al., 2019</p>
</div>

---
class: center, middle

<div class="figure" style="text-align: center">
<img src="figures/kelleher_2018_fig_4.png" alt="source: Kelleher et al., 2018" width="700px" />
<p class="caption">source: Kelleher et al., 2018</p>
</div>
 
???

Original tree vs. minimal tree (listing only current-alive individuals and their ancestors)