PhaseTypeR: an R package for phase-type distributions in population genetics

Phase-type distributions have traditionally been used in actuarial sciences and queuing theory, and, more recently, in population genetics. In order to facilitate the use of phase-type theory in population genetics, we present PhaseTypeR, a general-purpose and user-friendly R (R Core Team, 2021) package which contains all key functions —mean, (co)variance, probability density function, cumulative distribution function, quantile function and random sampling— for both continuous and discrete phase-type distributions. Importantly, univariate and multivariate reward transformations are implemented for continuous and discrete phase-type distributions. Multivariate reward transformations have great potential for applications in population genetics, and we have included two examples. The first is concerned with the easy calculation of the variance-covariance matrix for the site frequency spectrum (SFS) of the n-coalescent, and the second is concerned with the correlation between tree heights in the two-locus ancestral recombination graph.


Summary
Phase-type distributions describe the time until absorption of a continuous or discrete-time Markov chain (Bladt & Nielsen, 2017). The probabilistic properties of phase-type distributions (i.e., the probability density function, cumulative distribution function, quantile function, moments and generating functions) are well-described and analytically tractable using matrix manipulations.
Phase-type distributions have traditionally been used in actuarial sciences and queuing theory, and, more recently, in population genetics. In order to facilitate the use of phase-type theory in population genetics, we present PhaseTypeR, a general-purpose and user-friendly R (R Core Team, 2021) package which contains all key functions -mean, (co)variance, probability density function, cumulative distribution function, quantile function and random sampling-for both continuous and discrete phase-type distributions. Importantly, univariate and multivariate reward transformations are implemented for continuous and discrete phase-type distributions. Multivariate reward transformations have great potential for applications in population genetics, and we have included two examples. The first is concerned with the easy calculation of the variance-covariance matrix for the site frequency spectrum (SFS) of the -coalescent, and the second is concerned with the correlation between tree heights in the two-locus ancestral recombination graph.

Statement of need
In recent years, the usefulness of phase-type theory in population genetics has become evident. Important quantities in population genetics such as the time until the most recent ancestor, the total tree length, the total number of segregating sites, and the site frequency spectrum follow phase-type distributions (Hobolth et al., 2019). There are already several other R packages that model phase-type distributions, such as actuar (Dutang et al., 2008), mapfit (Hiroyuki Okamura, 2015;Hiroyuki Okamura & Dohi, 2015;H. Okamura & Dohi, 2016) or matrixdist (Albrecher et al., 2022;Albrecher & Bladt, 2019). However, these packages only model univariate continuous phase-type distributions, do not include reward transformations, and are tailored to actuarial sciences and queuing theory.
To overcome these limitations, we present PhaseTypeR, an R package with easy-to-use, generalpurpose phase-type functions, which is particularly well suited for population genetics. The package has already been used in Hobolth et al. (2021) to model the site frequency spectrum using multivariate phase-type theory, and we believe that its intuitive implementation will encourage more population geneticists to use phase-type theory.

Overview
PhaseTypeR contains general implementations of core functions for continuous and discrete phase-type distributions, for both the univariate and multivariate cases. These include the mean and (co)variance of phase-type distributions; their density, probability and quantile functions; functions for random draws; and functions for reward transformations. Table 1 provides an overview of the PhaseTypeR functions for a univariate continuous phase-type distribution ∼ PH( , ), where is the initial distribution and the sub-intensity matrix. Let { ( ) ∶ ≥ 0} denote the corresponding continuous-time Markov chain (CTMC). The reward transformation is then given by = ∫ 0 ( ( )) , where is the time to absorption, and is also phase-type distributed (Bladt & Nielsen, 2017). If the CTMC has transient states, then the reward function ( ), = 1, … , , is a vector of length . Table 1: formulas and corresponding PhaseTypeR functions for univariate continuous phase-type distributions. The vector determines the initial probabilities, is the sub-intensity matrix, is a vector with 1 in every entry, and r is the reward vector.

Quantity
Formula Function  Table 2 provides an overview of the PhaseTypeR functions for the univariate discrete phase-type distribution. Here, the reward transformation is given by Campillo Navarro (2018). Table 3 gives an overview of the multivariate phase-type distribution. A multivariate phase-type distribution is the joint distribution of ( 1 , … , ) where = ∫ 0 ( ( )) for = 1, … , . We summarize the rewards ( ) in a matrix with rows (corresponding to the transient states) and columns (corresponding to the reward functions) with entries = ( ). The vector determines the initial probabilities, is the sub-transition matrix, is a vector with one in every entry, is the identity matrix, and r is the reward vector.   (2008), which can be used to derive the variance of the elements of the site frequency spectrum (SFS) and the covariance between pairs of elements of the SFS.
Let be the 'th element of the site frequency spectrum, i.e., 1 is the number of singletons, 2 is the number of doubletons, etc. Let's also define , which is the total branch length leading to . Theorem 3.1 in Hobolth et al. (2019) states that the vector = ( 1 , … , −1 ) has a multivariate phase-type distribution where and are respectively the state-space and the sub-transition matrix of the so-called "block-counting process", and 1 = (1, 0, … , 0). Conditionally on , the 's are independent and Poisson distributed, | ∼ Poisson ( 2 ), where is the underlying mutation rate (Wakeley, 2009). By the law of total variance, we have where diag(⋅) is the diagonal matrix whose entries are given by E[ ]. It is well-known that E[ ] = 1/ , but the expressions for the entries of Σ are fairly complicated (Durrett, 2008;Fu, 1995). However as seen below, numeric calculation of Σ is straightforward using phase-type theory, since we just need to specify the subintensity matrix and the reward matrix for the block-counting process.
Accompanying our package are a number of vignettes, which illustrate the use of phase-type distribution in population genetics. As part of one of these vignettes, we include a function that calculates and for the block-counting process with sample size , which is shown below:  We can now define a multivariate phase-type distribution such that ∼ MPH( , , ). This is straightforward to build in PhaseTypeR with the MPH() function. For = 8: This yields the same variance-covariance matrix as in Theorem 2.2 in Durrett (2008) without the need for analytical derivations.

Example 2: the coalescent with recombination
The traditional procedure for deriving the correlation between the branch lengths in two loci for a sample of size two is by a first-step analysis (e.g., section 7 in Wakeley, 2009). In this section, we exemplify the easy use of PhaseTypeR to obtain the same result.
The state space and transition rates for the two-locus ancestral recombination graph is shown in Figure 1. The filled circles represent material ancestral to the sample, and the crosses indicate that the most recent common ancestor has been found. The lines between the circles or crosses indicate if the ancestral material is present on the same chromosome. The starting state is state 1 at present day with two samples from the same chromosome.
The time when both loci have found their common ancestor is PH( 1 , ) distributed with 1 = (1, 0, … , 0) and where is the recombination rate.
The tree height left in the left locus is the first time the ancestral process { ( ) ∶ ≥ 0} enters state 4 or state 6 or, equivalently, the time spent in state 1, 2, 3 and 5 before absorption in state 6. We therefore have with the reward vector left = (1, 1, 1, 0, 1). Similarly, the tree height right in the right locus is the first time the ancestral process enters state 5 or state 6 or, equivalently, the time spent in state 1, 2, 3 and 4 before absorption in state 6. We therefore have with the reward vector right = (1, 1, 1, 1, 0). A classical result in population genetics gives the covariance between the two tree heights Cov( left , right ) = + 18 2 + 13 + 18 , and we note that for large recombination rates Cov( left , right ) is close to zero, and for small recombination rates it is close to one. Also note that left and right are both exponentially distributed with a rate of 1, so Var( left ) = Var( right ) = 1, and, consequently, Cor( left , right ) = Cov( left , right ) (see also equation 3.10 in Wakeley, 2009). Moreover, as shown by a simple proof in Wilton et al. (2015), we have that ( left = right ) = Cov( left , right ). An implementation using PhaseTypeR simply consists of specifying the initial distribution, rate matrix for the ancestral process, rewards for the two tree heights, and calling the variance function var() for the multivariate phase-type distribution.
From this multivariate phase-type representation of the ancestral recombination graph (ARG), we can simulate, for example, 1,000 samples from the joint distribution of ( left , right ) using rMPH(1000, T_joint) in PhaseTypeR. If the recombination rate is set to a small value, then most of the samples will result in left = right , and the joint density will concentrate along the diagonal, as shown in Figure 2, left (Simonsen & Churchill, 1997). If instead is large, then most of the samples will result in left ≠ right (Figure 2, right).

Conclusion
We have described PhaseTypeR, an easy-to-use package for the analysis of time-homogeneous evolutionary models in population genetics. We have included two examples: (1) the mean and variance for the SFS of the -coalescent with mutation, and (2) the correlation for the tree height in the two-locus coalescent with recombination. The multiple merger coalescent (Birkner & Blath, 2021), the two-island model (Legried & Terhorst, 2022) and the seed bank coalescent (Casanova et al., 2022) constitute other coalescent models where phase-type analyses have been useful. We hope that population geneticists will take advantage of PhaseTypeR in future analyses of coalescent models.