My attempt to replicate Pairwise Sequentially Markovian Coalescent (PSMC) model - one of the classical population genetic algorithm - in Python and to trace the evolutionary history of humanity.
The wonders of population genetics uncovers the echoes of our ancestor's tumultuous past, permanently imprinted in our DNA. In this post, I will share my experience replicating one of my all-time favourite papers Inference of human population history from individual whole-genome sequences
I first encountered PSMC about five years ago, when it was used to investigate the decline of a charming bird species - the spoon-billed sandpiper. The idea that one could trace a species’ population history struck me as something absolutely remarkable. I made several attempts to dig into the original paper, but found the method’s description mathematically dense and requiring extensive knowledge in the field. Searching elsewhere, I didn’t come across an explanation that would be both comprehensive and accessible to those outside of population genetics field. Inspired by R. Feynman’s quote, “What I cannot create, I do not understand”, I decided to reimplement the PSMC algorithm in Python from scratch. Here, I present the key concepts behind PSMC, describe the algorithm’s implementation and showcase some exciting applications of the method that shed light on human evolutionary history.
Like everyone else, you have inherited two homologous sets of chromosomes – one from each parent. Your parents, in turn, received their chromosomes from their own parents, and so on. Imagine tracing the journey of one pair of your chromosomes back in time (let’s assume no recombinations for now). Eventually, you’d find that both chromosomes in the pair were once a single chromosome belonging to a distant ancestor. Since their separation, they have each accumulated unique mutations \(D\) before being passed down to you. By knowing the length of the chromosome \(L\), mutation rate \(\mu\), and assuming this rate has been constant, we can estimate time \(t\) when your most recent common ancestor (MRCA) lived:
\[t = \frac{D}{2 \mu L}\]For example, consider you have detected \(100\) mutations on your chromosome 1 (~\(248\) Mbp). Given a mutation rate of \(2.5 \times 10^{-8}\) mutations per base per generation
In reality, this approach wouldn’t work well due to the process of recombination, which occurs at a significant rate between homologous pairs
Without recombination, we could only estimate a single MRCA per chromosome, as one of the parental homologues is discarded in each generation. But recombination permits the exchange of genetic material between the chromosome pairs, preserving segments of parental DNA in subsequent generations. This implies that our chromosomes are a mosaic of thousands of fragments, each originating from thousands of ancestors spread across millions of years! A single chromosome holds a rich tapestry of our ancestral history.
The distribution of the segments MRCA time (\(t\)) is important. According to coalescent theory, this distribution — also described as coalescence probability — can be approximately modelled using an exponential distribution, assuming a constant large effective population size \(N_e\)
This is an extremely oversimplified model. However despite its simplicity, it gives a good proxy for how the distribution of \(t\) for the crossover segments might look like under a scenario of constant population size.
The probability of coalescence at a given time \(p(t)\) is directly influenced by historical fluctuations in population size. During periods of population expansion, the likelihood of coalescence events tends to decrease. In contrast, population bottlenecks — periods when the population size is significantly reduced — typically see an increase in the number of coalescence events.
By accurately identifying crossover segments and estimating the distribution of coalescent events, one can infer the historical changes in population size \(N(t)\). This is exactly the link that PSMC method exploits.
In practice, we don’t actually know the historical locations of crossover breakpoints, and estimating this information is far from straightforward. One approach is to analyse changes in the frequency of heterozygous states throughout the genome. One way of doing it is by comparing the frequency of the heterozygotic states throughout the genome. If there’s an apparent change, particularly assuming that the segments have differing coalescence times \(t\), it suggests the presence of a breakpoint. Another helpful piece of information is the length of the crossover segment. Older segments tend to be shorter, largely because they have had more time to undergo crossover events compared to younger segments. By formally defining these relationships as functions of population parameters, one can effectively circumvent the problem of crossover segment detection.
The authors developed a sequential hidden Markov model (HMM)
Any HMM model defined over \(k\) hidden states \(Z\) is parametrised by three key components:
In the PSMC model, the hidden states are coalescent times $t$ discretised over $k$ intervals. On of the free parameters is the scaled mutation rate \(\theta\)
Although the original paper by Li and Durbin
Somehow, allowing \(T_{max}\) to be inferred speeds up the model’s convergence, despite the fact that its value typically does not stray far from its initial setting \(T_{max} = 15\), as stated in the manuscript.
The goal is to identify the parameters \(\theta\), \(\rho\) and \(\lambda\) that maximise the likelihood of the observed data \(Y\):
\[\mathop{\mathrm{argmax}}_{\theta,\rho,\lambda} \hspace{0.25em} p(Y | \theta,\rho,\lambda)\]Maximizing the likelihood of a HMM directly is impractical due to the complexity of the joint distribution, \(\sum_{Z} p(Y,Z \vert \theta,\rho,\lambda)\). This joint distribution is not independent across \(N\) observations and the number of terms in the summation grows exponentially as \(K^N\), where \(N\) is the length of the sequence, which gets extremely large in genomic sequences. However, a more manageable surrogate objective can be formulated as part of the Expectation Maximization algorithm (Baum–Welch algorithm).
The optimisation is done in 2 steps: E step, compute posteriors of hidden states \(Z_{n,k}\), denoted \(\gamma_{n,k}\) and posterior of the hidden states transitions between \(Z_{n,k}\) pairs, denoted \(\xi_{n,k,k'}\). This allows to define a surrogate optimization function \(Q\):
\[Q = \sum_{K}\gamma_{1,k}\log \pi_k + \sum_{N}\sum_{K}\sum_{K}\xi_{n,k,k'}\log A_{k,k'} + \sum_{N}\sum_{K}\log b_k(y_n)\]M step, maximise \(Q\) with respect to the parameters \(\theta\), \(\rho\) and \(\lambda\), recomputing \(\pi_k\), \(A_{k,k'}\) and \(b_k\).
Repeating these steps ensures that the likelihood converges to a local maximum. The motivation and derivation for the EM algorithm are well described in Christopher Bishop’s book “Pattern Recognition and Machine Learning”
A practical implementation of this model, consists of 2 parts. A first part just computes main model variables, given parameters. The rest is standard HMM functionality to compute instrumental variables alpha
, beta
and finally EM-algorithm.
When working with code that has two distinct components, it’s more effective to develop and test them separately. I found myself stuck in a never-ending loop of debugging for several days when I tried to implement an algorithm all at once. This changed when I decided to create a basic HMM to specifically test the HMM functionality. Additionally, I designed simple tests to assess the accuracy of the PSMC parameter calculations. Unsurprisingly, there were bugs in both components.
def compute_params(self):
# Get the model parameters
t = self.t
theta, rho = self.theta, self.rho
n, lam = self.n_steps, self.lam
# Initialise arrays
sigma = np.zeros(n+1) # prior probabilities π
e = np.zeros((3, n+1)) # emission probabilities b
p_kl = np.zeros((n+1, n+1)) # transition probabilities A
# Compute all the variables just as written in the paper
# (as efficiently as possible)
return C_pi, C_sigma, p_kl, e, sigma
At first, I tested it on 300 Mb synthetic data, roughly the size of the 1st chromosome. Results look similar to the paper, with pitfalls such as low accuracy on time intervals before ~10K years and above ~2M years, and oversmoothing of rapid step-like population size changes.
Now the fun part – what could real genomes tell? I run PSMC on the chromosomes 1-5 for six individuals: three of diverse African origins (San, Mbuti, and Maasai), a European (French), an East Asian (Japanese), and an Austronesian (Ami). Genome data is from The International Genome Sample Resource.
It’s evident that all humans shared similar history until approximately 150K years ago, surviving a significant bottleneck around 1M years ago. This bottleneck appears to be embedded not only in our DNA but also in fossil records. Climate change may be a factor, among others.
Approximately 150,000 years ago, the history of the selected individuals’ ancestors diverged. The noteworthy bottleneck experienced by all migrating populations (🟦) was surprisingly similar, despite the vast distances between their descendants.
In contrast, the ancestors of African populations (🟥) had remarkably different histories, resulting in a decline in effective population size concurrent with their relatives’ migrations to Eurasia. The Toba volcanic eruption is among the hypotheses to explain this phenomenon.
By correlating inferred population sizes with archaeological data, it’s tempting to speculate about causality. E.g., did the use of 🔥 or increasing 🧠 volume aided the recovery from the bottleneck 1M years ago? However, as with any model, 𝗡𝗲 is to be interpreted with care
A little story at the end. Why did I spend time on this instead of working on my thesis?
Well, partly because this method is just awesome. Partly because I really wanted to play with HMMs and EM. But also, because I made a promise to myself at the beginning of my PhD to try. During my PhD interview with @Nick_Goldman, I shared this as my favourite paper, acknowledging my limited understanding of how the method works. Nick then asked me how long it’d take to replicate if the math is provided, and I somewhat naively replied, “I guess three weeks?🤔”.
It took me numerous evenings of dedicated effort to replicate the task at hand. As someone with limited computational experience four years ago, I cannot stress enough how much of an underestimation the “three weeks” suggestion was.