Conceptual overview of tree sequence recording

fwdpy11 0.2.0 added support for tree sequence recording (TSR), which is a method to both speed up forward-time simulations and to track the entire genealogical history of the simulation. The key reference describing TSR is [Kelleher2018], which in turn relies heavily on concepts described in [Kelleher2016].

The goal of TSR is to track a set of tables that contain the trees describing the ancestry of the population.

Trees and tables

To start out, let us consider the following tree:

../_images/tree.png

Fig 1: A tree with seven nodes.

This tree is the “marginal” history of a genomic segment covering the half-open interval \([l, r)\). In other words, all genomic positions in this interval have the same ancestry, and recombination results in different intervals having different trees.

Following [Kelleher2016], we can represent the above tree using two tables:

../_images/tables.png

Fig 2: The node and edges tables corresponding to Fig 1.

We learn two things from Fig 2:

  1. Node tables track the birth times of nodes. Here, we measure time as increasing from past to the present.

  2. Edge tables record the transmissions of genomic intervals from parents to children. The parent/child fields are indexes of the node table.

Edge tables have specific sorting requirements. The sorting is nested:

  1. Decreasing order of parent birth times (as we read the table from top to bottom).

  2. For edges with the same parent, child indexes are sorted in increasing order

  3. Finally, edges are sorted by increasing left position.

Relating an evolving populaton to node and edge tables

Consider the case of a diploid Wright-Fisher population. There are \(N\) individuals, and thus \(2N\) haploid genomes. Define a genome as all of the genomic intervals inherited from a single parent. Positions along the genome have values from the interval \([0,L)\), and we do not care if positions take on continuous or discrete values within that interval.

At any time point, we may describe the population as consisting of \(2N\) nodes with integer labels \([i,i+2N)\). Our diploids are then defined as adjacent tuples of nodes, \(D \in [(i,i+1),(i+2,i+3),\ldots,(i+2N-2,i+2N-1)]\). Node labels thus represent haploid genomes.

To generate the next generation, offspring are generated by recording which intervals they inherit from which parental nodes. These intervals are determined by the usual rules of evolving a Wright-Fisher model with selection and recombination. We record these transmission events as edges, which we may describe as tuples (left, right, parent, child). In words, such a tuple means “Parental node label parent transmitted the genomic interval \([left,right)\) to offspring node label child”.

The task of a forward simulation is to record new nodes and edges as they arise. The simulation ends up recording many transmission events that quickly go extinct. The simplification algorithm described in the 2018 paper mentioned above takes this “messy” node and edge table and returns “simplified” node and edge tables.

We can visualize this process using an example taken from the tskit tutorials, which implements the discrete-time Wright-Fisher model for a diploid population without recombination and without selection:

In [1]: import tskit

In [2]: import numpy as np
In [3]: def wf1(N, ngens):
   ...:     tc = tskit.TableCollection(1.0)
   ...:     for i in range(2 * N):
   ...:         tc.nodes.add_row(time=0, flags=tskit.NODE_IS_SAMPLE)
   ...:     next_offspring_index = len(tc.nodes)
   ...:     first_parental_index = 0
   ...:     for gen in range(1, ngens + 1):
   ...:         assert next_offspring_index == len(tc.nodes)
   ...:         assert first_parental_index == len(tc.nodes) - 2 * N
   ...:         parents = np.random.randint(0, N, 2 * N)
   ...:         for parent1, parent2 in zip(parents[::2], parents[1::2]):
   ...:             mendel = np.random.random_sample(2)
   ...:             g1 = first_parental_index + 2 * parent1 + (mendel[0] < 0.5)
   ...:             g2 = first_parental_index + 2 * parent2 + (mendel[1] < 0.5)
   ...:             tc.nodes.add_row(time=gen, flags=tskit.NODE_IS_SAMPLE)
   ...:             tc.nodes.add_row(time=gen, flags=tskit.NODE_IS_SAMPLE)
   ...:             tc.edges.add_row(left=0.0, right=1.0, parent=g1, child=next_offspring_index)
   ...:             tc.edges.add_row(
   ...:                 left=0.0, right=1.0, parent=g2, child=next_offspring_index + 1
   ...:             )
   ...:             next_offspring_index += 2
   ...:         first_parental_index += 2 * N
   ...:     return tc
   ...: 

Let’s run the simulation for a few generations and look at the resulting tree:

In [4]: np.random.seed(42)

In [5]: tc = wf1(3, 4)

# Before we can get a tree sequence from
# the data, we must change direction of
# time from foward to backwards to satisty
# tskit:
In [6]: t = tc.nodes.time

In [7]: t -= tc.nodes.time.max()

In [8]: t *= -1.0

In [9]: tc.nodes.set_columns(time=t, flags=tc.nodes.flags)

# Sort the tables:
In [10]: tc.sort()

In [11]: ts = tc.tree_sequence()

In [12]: print(ts.first().draw(format="unicode"))
  0     1  2  3           4             5  
 ┏┻━┓   ┃                 ┃            ┏┻━┓
10 11   7                 9            6  8
        ┃         ┏━━━┳━━━┻━┳━━━━━┓       ┃
       13        12  14    16    17      15
                     ┏┻━┓  ┏┻━┓   ┃       ┃
                    20 23 18 21  19      22
                     ┃  ┃  ┃  ┃  ┏┻━┓      
                    27 26 24 29 25 28      

The resulting tree contains information for extinct lineages as well as redundant node information. Note that the three diploids in the last generation are defined by node pairs (24,25), (26,27), and (28,29).

Let’s apply the simplification algorithm that:

In [13]: samples = np.where(tc.nodes.time == 0)[0]

In [14]: node_map = tc.simplify(samples=samples.tolist())

In [15]: ts = tc.tree_sequence()

In [16]: tree = ts.first()

In [17]: imap = {node_map[node]: node for node in range(len(node_map))}

In [18]: nl = {i: "{}->{}".format(imap[i], i) for i in tree.nodes()}

In [19]: print(tree.draw(format="unicode", node_labels=nl))
               9->9                
     ┏━━━━━━━━━━━╋━━━━━━━━━━━┓     
     ┃         14->7       16->8   
     ┃        ┏━━┻━━┓     ┏━━┻━━┓  
   19->6      ┃     ┃     ┃     ┃  
  ┏━━┻━━┓     ┃     ┃     ┃     ┃  
25->1 28->4 26->2 27->3 24->0 29->5

That’s much nicer! The simplified tree shows now the input node ids are remapped to output node ids in such a manner that relative ordering is preserved.

Thus, the most practical view of TSR is this: we speed up the simulations by not simulating neutral mutations. We only have to simulate the selected variants and occasionally simplify our messy trees. The realized speedups are huge, and I refer you to the 2018 paper for the data on that. But our simulations are not only faster. They record much more information. The tables of nodes, edges, etc., record the entire history of the simulation with respect to a set of sample nodes.

Anyone interested in some of the more technical details of implementing TSR can take a look at the tutorials accompanying the 2018 paper.

Sample recording

Todo

discuss the current generation vs historical/ancient/preserved samples.