Calculating frequency spectra from tree sequences

The mutation frequency spectrum is an array-like object representing the number of mutation events occurring at different frequencies. Generating frequency spectra, or fs, from simulations is handled by fwdpy11.TableCollection.fs().

We have to define some conventions:

  • A sample list is a list of node indexes. The node indexes are stored in a numpy.ndarray with numpy.dtype numpy.int32.

  • We support fs calculated from multiple lists of samples.

  • An fs always contains the zero and fixed bins. Thus, for a sample of n nodes, there are n + 1 entries in the fs. The first and last values are for frequencies zero and n, respectively. Singletons start in bin 1, etc..

  • An fs from a single sample list is represented by a numpy.ma.MaskedArray, object with the zero and fixed bins masked.

  • An fs from more than one sample list is stored in a sparse.COO sparse matrix.

For our examples, we will initialize a fwdpy11.DiploidPopulation from a tskit.TreeSequence generated by msprime.simulate().

Note

These examples do not show how to get fs separately for neutral and non-neutral mutations. See fwdpy11.TableCollection.fs() for details.

In [1]: import fwdpy11

In [2]: import msprime

In [3]: import numpy as np

In [4]: rng = fwdpy11.GSLrng(4321678)

In [5]: config = [msprime.PopulationConfiguration(500), msprime.PopulationConfiguration(500)]

In [6]: ts = msprime.simulate(
   ...:     population_configurations=config,
   ...:     Ne=500.0,
   ...:     random_seed=777,
   ...:     migration_matrix=np.array([0, 0.1, 0.1, 0]).reshape(2, 2),
   ...:     recombination_rate=0.25,
   ...: )
   ...: 

In [7]: pop = fwdpy11.DiploidPopulation.create_from_tskit(ts)

In [8]: md = np.array(pop.diploid_metadata, copy=False)

In [9]: np.unique(md["deme"], return_counts=True)
Out[9]: (array([0, 1], dtype=int32), array([250, 250]))

In [10]: nmuts = fwdpy11.infinite_sites(rng, pop, 0.1)

In [11]: nmuts
Out[11]: 2924

The following blocks show several methods for obtaining the fs from lists of nodes. First, let’s get the lists of nodes from the two demes in our population:

In [12]: nodes = np.array(pop.tables.nodes, copy=False)

In [13]: alive_nodes = pop.alive_nodes

In [14]: deme0_nodes = alive_nodes[np.where(nodes["deme"][alive_nodes] == 0)[0]]

In [15]: deme1_nodes = alive_nodes[np.where(nodes["deme"][alive_nodes] == 1)[0]]

Get an fs from nodes found only in deme 0:

In [16]: pop.tables.fs([deme0_nodes[:10]])
Out[16]: 
masked_array(data=[--, 416, 222, 122, 93, 61, 66, 45, 39, 46, --],
             mask=[ True, False, False, False, False, False, False, False,
                   False, False,  True],
       fill_value=999999,
            dtype=int32)

Get a joint fs from nodes from each deme:

In [17]: fs = pop.tables.fs([deme0_nodes[:10], deme1_nodes[50:55]])

In [18]: fs
Out[18]: <COO: shape=(11, 6), dtype=int32, nnz=53, fill_value=0>

Obtain the full numpy.ndarray for the joint fs:

In [19]: fs.todense()
Out[19]: 
array([[  0, 166,  39,   1,   0,   0],
       [255, 123,  32,   6,   0,   0],
       [ 93,  72,  36,  18,   2,   1],
       [ 18,  50,  30,  19,   5,   0],
       [ 12,  23,  32,  23,   3,   0],
       [  4,  19,  18,  16,   3,   1],
       [  4,   7,  23,  17,  12,   3],
       [  0,   0,  13,  19,   4,   9],
       [  0,   4,   5,  12,  13,   5],
       [  0,   1,   1,   7,  26,  11],
       [  0,   0,   1,   7,   8,  14]], dtype=int32)

Warning

The joint fs can take a lot of memory!

We can use standard array operations to get the marginal fs from our joint fs:

In [20]: fs.sum(axis=1).todense()
Out[20]: array([206, 416, 222, 122,  93,  61,  66,  45,  39,  46,  30])

In [21]: fs.sum(axis=0).todense()
Out[21]: array([386, 465, 230, 145,  76,  44])

Note

Be careful when processing sparse matrix objects! Naive application of regular numpy functions can lead to erroneous results. Be sure to check the sparse documentation.

The marginalization can be tedious for many samples, so you can have it happen automatically, in which case a dict is returned, keyed by sample list index:

In [22]: fs = pop.tables.fs([deme0_nodes[:10], deme1_nodes[50:55]], marginalize=True)

In [23]: for key, value in fs.items():
   ....:     print(key)
   ....:     print(value)
   ....:     print(value.data)
   ....: 
0
[-- 416 222 122 93 61 66 45 39 46 --]
[206 416 222 122  93  61  66  45  39  46  30]
1
[-- 465 230 145 76 --]
[386 465 230 145  76  44]

Note

Marginalizing in this way preserves the convention that the 1-d fs objects are instances of numpy.ma.MaskedArray.

To see how the dict keying works, let’s flip the sample lists:

In [24]: fs = pop.tables.fs([deme1_nodes[50:55], deme0_nodes[:10]], marginalize=True)

In [25]: for key, value in fs.items():
   ....:     print(key)
   ....:     print(value)
   ....:     print(value.data)
   ....: 
0
[-- 465 230 145 76 --]
[386 465 230 145  76  44]
1
[-- 416 222 122 93 61 66 45 39 46 --]
[206 416 222 122  93  61  66  45  39  46  30]

If you only want the fs from particular regions of the genome. By default, the fs is the sum across windows:

In [26]: pop.tables.fs([deme0_nodes[:10]], windows=[(0.1, 0.2), (0.8, 0.9)])
Out[26]: 
masked_array(data=[--, 103, 41, 21, 12, 7, 4, 20, 8, 12, --],
             mask=[ True, False, False, False, False, False, False, False,
                   False, False,  True],
       fill_value=999999,
            dtype=int32)

You can get the fs separately by window, too:

In [27]: pop.tables.fs(
   ....:     [deme0_nodes[:10]], windows=[(0.1, 0.2), (0.8, 0.9)], separate_windows=True
   ....: )
   ....: 
Out[27]: 
[masked_array(data=[--, 43, 26, 10, 7, 5, 2, 7, 5, 6, --],
              mask=[ True, False, False, False, False, False, False, False,
                    False, False,  True],
        fill_value=999999,
             dtype=int32),
 masked_array(data=[--, 60, 15, 11, 5, 2, 2, 13, 3, 6, --],
              mask=[ True, False, False, False, False, False, False, False,
                    False, False,  True],
        fill_value=999999,
             dtype=int32)]

You can also get a joint fs marginalized by sample list and separated by window. In this case, the return value is a list containing the dict for each window:

In [28]: pop.tables.fs(
   ....:     [deme0_nodes[:10], deme1_nodes[:20]],
   ....:     windows=[(0.1, 0.2), (0.8, 0.9)],
   ....:     marginalize=True,
   ....:     separate_windows=True,
   ....: )
   ....: 
Out[28]: 
[{0: masked_array(data=[--, 43, 26, 10, 7, 5, 2, 7, 5, 6, --],
               mask=[ True, False, False, False, False, False, False, False,
                     False, False,  True],
         fill_value=999999),
  1: masked_array(data=[--, 39, 16, 20, 8, 7, 8, 3, 4, 6, 3, 2, 1, 3, 3, 2, 2,
                     5, 1, 1, --],
               mask=[ True, False, False, False, False, False, False, False,
                     False, False, False, False, False, False, False, False,
                     False, False, False, False,  True],
         fill_value=999999)},
 {0: masked_array(data=[--, 60, 15, 11, 5, 2, 2, 13, 3, 6, --],
               mask=[ True, False, False, False, False, False, False, False,
                     False, False,  True],
         fill_value=999999),
  1: masked_array(data=[--, 35, 27, 24, 13, 8, 3, 4, 8, 6, 4, 1, 4, 1, 2, 0, 1,
                     6, 3, 2, --],
               mask=[ True, False, False, False, False, False, False, False,
                     False, False, False, False, False, False, False, False,
                     False, False, False, False,  True],
         fill_value=999999)}]

Simplifying to the samples

Finally, it is sometimes more efficient to simplify the tree sequences with respect to the sample nodes. For example, if there are a vast number of ancient samples and you are processing each time point separately (see fwdpy11.DiploidPopulation.sample_timepoints()), then not simplifying means iterating over trees that are redundant/irrelevant to the history of the current time point. In order to get the fs from a simplified tree sequence, pass simplify=True when calling fwdpy11.TableCollection.fs().