Inference of human population history from a single genome

My attempt to replicate Pairwise Sequentially Markovian Coalescent (PSMC) model - one of the classical population genetic algorithm - in Python and to trace the evolutionary history of humanity.

Introduction

The wonders of population genetics uncovers the echoes of our ancestor's tumultuous past, permanently imprinted in our DNA. In this post, I will share my experience replicating one of my all-time favourite papers Inference of human population history from individual whole-genome sequences by Heng Li and Richard M. Durbin. This paper introduces a method called Pairwise Sequentially Markovian Coalescent analysis (PSMC), that infers population size history directly from a single genome, or even just from a part of it. The core algorithm works like a time machine, guided by genomic information. By analysing the DNA sequence, it reveals how the population of our ancestors changed over a vast period of time, from ~10,000 to as far back as ~10,000,000 years ago.

I first encountered PSMC about five years ago, when it was used to investigate the decline of a charming bird species - the spoon-billed sandpiper. The idea that one could trace a species’ population history struck me as something absolutely remarkable. I made several attempts to dig into the original paper, but found the method’s description mathematically dense and requiring extensive knowledge in the field. Searching elsewhere, I didn’t come across an explanation that would be both comprehensive and accessible to those outside of population genetics field. Inspired by R. Feynman’s quote, “What I cannot create, I do not understand”, I decided to reimplement the PSMC algorithm in Python from scratch. Here, I present the key concepts behind PSMC, describe the algorithm’s implementation and showcase some exciting applications of the method that shed light on human evolutionary history.

Basics of coalescent theory

Like everyone else, you have inherited two homologous sets of chromosomes – one from each parent. Your parents, in turn, received their chromosomes from their own parents, and so on. Imagine tracing the journey of one pair of your chromosomes back in time (let’s assume no recombinations for now). Eventually, you’d find that both chromosomes in the pair were once a single chromosome belonging to a distant ancestor. Since their separation, they have each accumulated unique mutations $D$ before being passed down to you. By knowing the length of the chromosome $L$, mutation rate $\mu$, and assuming this rate has been constant, we can estimate time $t$ when your most recent common ancestor (MRCA) lived:

\[t = \frac{D}{2 \mu L}\]

For example, consider you have detected $100$ mutations on your chromosome 1 (~$248$ Mbp). Given a mutation rate of $2.5 \times 10^{-8}$ mutations per base per generation, you can estimate $T \approx 8$ generations. If we assume an average of $25$ years per generation, this suggests that the MRCA of your chromosome 1 existed approximately $200$ years ago.

In reality, this approach wouldn’t work well due to the process of recombination, which occurs at a significant rate between homologous pairs. Recombination essentially shuffles the genetic material by breaking chromosomes into fragments and mixing the maternal and paternal homologous sections. This complexity makes it much harder to accurately estimate $t$ for these segments, as it requires identifying the points of recombination and determining which sections are continuous. However, this same process offers a surprising advantage that enables PSMC analysis.

Without recombination, we could only estimate a single MRCA per chromosome, as one of the parental homologues is discarded in each generation. But recombination permits the exchange of genetic material between the chromosome pairs, preserving segments of parental DNA in subsequent generations. This implies that our chromosomes are a mosaic of thousands of fragments, each originating from thousands of ancestors spread across millions of years! A single chromosome holds a rich tapestry of our ancestral history.

The distribution of the segments MRCA time ($t$) is important. According to coalescent theory, this distribution — also described as coalescence probability — can be approximately modelled using an exponential distribution, assuming a constant large effective population size $N_e$One need to be careful not to over-interpret $N_e$, as it is an abstract value influenced by many factors beyond the number of individuals. For example, population structure, migration constrains or variable reproductive success would directly affect it.:

\[p(t) = \frac{1}{2 N_e} e^{-\frac{t}{2 N_e}}\]

This is an extremely oversimplified model. However despite its simplicity, it gives a good proxy for how the distribution of $t$ for the crossover segments might look like under a scenario of constant population size.

The probability of coalescence at a given time $p(t)$ is directly influenced by historical fluctuations in population size. During periods of population expansion, the likelihood of coalescence events tends to decrease. In contrast, population bottlenecks — periods when the population size is significantly reduced — typically see an increase in the number of coalescence events.

By accurately identifying crossover segments and estimating the distribution of coalescent events, one can infer the historical changes in population size $N(t)$. This is exactly the link that PSMC method exploits.

PSMC model

In practice, we don’t actually know the historical locations of crossover breakpoints, and estimating this information is far from straightforward. One approach is to analyse changes in the frequency of heterozygous states throughout the genome. One way of doing it is by comparing the frequency of the heterozygotic states throughout the genome. If there’s an apparent change, particularly assuming that the segments have differing coalescence times $t$, it suggests the presence of a breakpoint. Another helpful piece of information is the length of the crossover segment. Older segments tend to be shorter, largely because they have had more time to undergo crossover events compared to younger segments. By formally defining these relationships as functions of population parameters, one can effectively circumvent the problem of crossover segment detection.

The authors developed a sequential hidden Markov model (HMM)Sequential HMM is just a HMM model of the sequence data. For example, here PSMC algorithm models a sequence of heterozygosity states, which innovatively connects the history of population size with the heterozygosity observed in genome sequences. The key accomplishment of the paper lies in the derivation of the necessary mathematics to define this HMM, based on the assumption of a Markovian processThe assumption that each hidden state depends solely on its predecessor, as used in this model, clearly lacks a biological basis.. While this is a strong assumption, as the saying goes, “All models are wrong, but some are useful.” As I will show later, this model has indeed demonstrated its utility in practical applications.

Any HMM model defined over $k$ hidden states $Z$ is parametrised by three key components:

Emission probability function $b_k$, which describes the likelihood of observing a particular outcome $Y$ from each hidden state.
Transition probability $A_{k,k'}$, representing the probability of transitioning from one hidden state to another.
Prior distribution of states $\pi_k$, indicating the probability of each hidden state at the start of the process.

In the PSMC model, the hidden states are coalescent times $t$ discretised over $k$ intervals. On of the free parameters is the scaled mutation rate $\theta$This represents the neutral mutation rate $\mu$ scaled by $N_0$ as $\theta=4N_0 \mu$, which determines the emission probability of observing a heterozygous state, defined as $b_{mut}=e^{-\theta t}$. The other two free parameters are: the crossover rate $\rho$ and the scaled population size at a given coalescence time $\lambda(t)$. These parameters influence the transition probability between coalescent times $A_{k,k'}$ — essentially determining whether to continue along the same segment or introduce a breakpoint and change the state — and the initial state distribution $\pi_k$. The specific formulations for $A_{k,k'}$ and $\pi_k$ are quite complex and are thoroughly detailed in the Supplementary Methods section of the paper.

Notes on the time discretisation

Although the original paper by Li and Durbin specifies that the PSMC model includes only three free parameters, their practical implementation, available on GitHub, has a fourth free parameter $T_{max}$. This parameter defines the discretisation of the continuous coalescent time into $n$ segments:

\[t_i = 0.1 e^{\frac{i}{n}\log (1 + 10 T_{max})} - 0.1\]

Somehow, allowing $T_{max}$ to be inferred speeds up the model’s convergence, despite the fact that its value typically does not stray far from its initial setting $T_{max} = 15$, as stated in the manuscript.

The goal is to identify the parameters $\theta$, $\rho$ and $\lambda$ that maximise the likelihood of the observed data $Y$:

\[\mathop{\mathrm{argmax}}_{\theta,\rho,\lambda} \hspace{0.25em} p(Y | \theta,\rho,\lambda)\]

Maximizing the likelihood of a HMM directly is impractical due to the complexity of the joint distribution, $\sum_{Z} p(Y,Z \vert \theta,\rho,\lambda)$. This joint distribution is not independent across $N$ observations and the number of terms in the summation grows exponentially as $K^N$, where $N$ is the length of the sequence, which gets extremely large in genomic sequences. However, a more manageable surrogate objective can be formulated as part of the Expectation Maximization algorithm (Baum–Welch algorithm).

The optimisation is done in 2 steps: E step, compute posteriors of hidden states $Z_{n,k}$, denoted $\gamma_{n,k}$ and posterior of the hidden states transitions between $Z_{n,k}$ pairs, denoted $\xi_{n,k,k'}$. This allows to define a surrogate optimization function $Q$:

\[Q = \sum_{K}\gamma_{1,k}\log \pi_k + \sum_{N}\sum_{K}\sum_{K}\xi_{n,k,k'}\log A_{k,k'} + \sum_{N}\sum_{K}\log b_k(y_n)\]

M step, maximise $Q$ with respect to the parameters $\theta$, $\rho$ and $\lambda$, recomputing $\pi_k$, $A_{k,k'}$ and $b_k$.

Repeating these steps ensures that the likelihood converges to a local maximum. The motivation and derivation for the EM algorithm are well described in Christopher Bishop’s book “Pattern Recognition and Machine Learning”.

Python implementation

A practical implementation of this model, consists of 2 parts. A first part just computes main model variables, given parameters. The rest is standard HMM functionality to compute instrumental variables alpha, beta and finally EM-algorithm.

Practical note

When working with code that has two distinct components, it’s more effective to develop and test them separately. I found myself stuck in a never-ending loop of debugging for several days when I tried to implement an algorithm all at once. This changed when I decided to create a basic HMM to specifically test the HMM functionality. Additionally, I designed simple tests to assess the accuracy of the PSMC parameter calculations. Unsurprisingly, there were bugs in both components.

def compute_params(self):
  # Get the model parameters
  t = self.t 
  theta, rho = self.theta, self.rho
  n, lam = self.n_steps, self.lam 

  # Initialise arrays
  sigma = np.zeros(n+1) # prior probabilities π
  e = np.zeros((3, n+1)) # emission probabilities b
  p_kl = np.zeros((n+1, n+1)) # transition probabilities A

  # Compute all the variables just as written in the paper
  # (as efficiently as possible)

  return C_pi, C_sigma, p_kl, e, sigma

Simulated data study

At first, I tested it on 300 Mb synthetic data, roughly the size of the 1st chromosome. Results look similar to the paper, with pitfalls such as low accuracy on time intervals before ~10K years and above ~2M years, and oversmoothing of rapid step-like population size changes.

Real human genomes and our unique history

Now the fun part – what could real genomes tell? I run PSMC on the chromosomes 1-5 for six individuals: three of diverse African origins (San, Mbuti, and Maasai), a European (French), an East Asian (Japanese), and an Austronesian (Ami). Genome data is from The International Genome Sample Resource.

It’s evident that all humans shared similar history until approximately 150K years ago, surviving a significant bottleneck around 1M years ago. This bottleneck appears to be embedded not only in our DNA but also in fossil records. Climate change may be a factor, among others.

Approximately 150,000 years ago, the history of the selected individuals’ ancestors diverged. The noteworthy bottleneck experienced by all migrating populations (🟦) was surprisingly similar, despite the vast distances between their descendants.

In contrast, the ancestors of African populations (🟥) had remarkably different histories, resulting in a decline in effective population size concurrent with their relatives’ migrations to Eurasia. The Toba volcanic eruption is among the hypotheses to explain this phenomenon.

By correlating inferred population sizes with archaeological data, it’s tempting to speculate about causality. E.g., did the use of 🔥 or increasing 🧠 volume aided the recovery from the bottleneck 1M years ago? However, as with any model, 𝗡𝗲 is to be interpreted with care

Afterword

A little story at the end. Why did I spend time on this instead of working on my thesis?

Well, partly because this method is just awesome. Partly because I really wanted to play with HMMs and EM. But also, because I made a promise to myself at the beginning of my PhD to try. During my PhD interview with @Nick_Goldman, I shared this as my favourite paper, acknowledging my limited understanding of how the method works. Nick then asked me how long it’d take to replicate if the math is provided, and I somewhat naively replied, “I guess three weeks?🤔”.

It took me numerous evenings of dedicated effort to replicate the task at hand. As someone with limited computational experience four years ago, I cannot stress enough how much of an underestimation the “three weeks” suggestion was.