SeleDiff: A fast and scalable tool for estimating and testing selection differences between populations

1 Chinese Academy of Sciences Key Laboratory of Computational Biology, Chinese Academy of Sciences-Max Planck Society Partner Institute for Computational Biology, Shanghai Institutes for Biological Sciences, Shanghai, 200031, China 2 Chinese Academy of Sciences, University of Chinese Academy of Sciences, Beijing, 100049, China 3 State Key Laboratory of Genetic Engineering and Ministry of Education Key Laboratory of Contemporary Anthropology, Collaborative Innovation Center for Genetics and Development, School of Life Sciences, Fudan University, Shanghai, 200433, China 4 Institutes of Biomedical Sciences, Shanghai Medical College, Fudan University, Shanghai, 200032, China DOI: 10.21105/joss.01545


Introduction
Analyzing natural selection is critically important in population genetics (Haldane, 1990).In the past 20 years, researchers have learned extensively about selection signals in genomic data (Vitti, Grossman, & Sabeti, 2013), but a deeper understanding of selection strength has remained elusive (Thurman & Barrett, 2016).This is particularly due to difficulties in estimating selective pressures using empirical data.In addition, as the amount of genomic data has dramatically increased, researchers require more efficient software for analyzing largescale genomic datasets.To meet these computational demands, we introduced and evaluated SeleDiff, a fast and scalable tool for quantifying differences in selective pressures between populations.

Results
SeleDiff implements a probabilistic method from our previous study (He et al., 2015).In this approach, we introduced logarithm odds ratios of allele frequencies to measure differences in selective pressures.For a bi-allelic locus in the population i, let p i (t) and q i (t) denote the derived and ancestral allele frequencies at time t.We define the absolute fitness of the derived and ancestral alleles as w D and w A .The relative fitness becomes where s is the (genic) selection coefficient.The selection (coefficient) difference between populations i and j is where OR stands for odds ratio; Ω approximately follows a normal distribution with a mean of zero and reflects the uncertainty of allele frequencies caused by factors other than selection; t is the divergence time from populations i and j to their most recent common ancestor.Thus, the expectation and variance of .
Given a dataset with n loci, we can estimate var (Ω) as median where var Here, N i and N j are the sample sizes of populations i and j.We add 0.5 to allele counts less than 5 for continuity correction.To test the selection differences in a locus, we proposed a statistic: where δ follows a central χ 2 -distribution with one degree of freedom in the absence of selection differences.
We evaluated SeleDiff in different demographic models (Figure 1) simulated by SLiM 2 (Haller & Messer, 2017).In Models 1-70, we assume larger selection coefficients in Population 1 than in Population 2 (Figure 1A-E).Without migration, SeleDiff accurately estimates selection differences ranging from 0 to 0.002/generation in scenarios with different population sizes (Figure 2A, Models 1-9).The estimated differences (Figure 2A, Models 10-17) are slightly smaller in scenarios with low initial frequencies (≤ 0.02) of the selective allele or long divergence times (≥ 5000 generations), because alleles with low initial frequencies are easily lost regardless of their selection coefficients, and alleles with small selection coefficients can reach high frequencies with long enough time.SeleDiff is affected little by time-varied population sizes (Figure 2A, Models 18-37), except for extremely severe bottlenecks in populations under less selective pressures (Figure 2A, Model 23).In Models 38-46 (Figure 2A), populations diverge into subpopulations, and selection stops in one of these subpopulations.If we ignore their structures, then the estimated differences diminish because SeleDiff treats all the individuals in a group homogenously.Therefore, it is important to select samples carefully and interpret results cautiously.In models with moderate migration rates (0.00001-0.0001/generation), the estimated differences are only slightly smaller than the given values, whereas strong migration reduces differences between populations (Figure 2B, Models 47-70), a well-known phenomenon in population genetics (Crow & Kimura, 2009).SeleDiff also works well in complex models (Figure 2B, Model 1a-6d) involving multiple demographic events from human evolution (Gravel et al., 2011).Thus, SeleDiff is robust in various demographic models, and   indicates the lower bounds of differences in selective pressures when migration or substructure exists.
Finally, we compared the performance of SeleDiff with other cross-population methods in two recent programs-4P and selscan-for genome-wide selection scans (Benazzo, Panziera, & Bertorelle, 2014;Szpeich & Hernandez, 2014).All the programs were executed with a single thread.SeleDiff can analyze a dataset containing 10 8 base pairs of variants in less than 1 hour (Figure 2C) with less than 4 gigabytes of random-access memory (Figure 3), and is much faster than the other two programs (Figure 2D).To enhance the scalability of SeleDiff, we integrated it with a newly developed online algorithm-t-digest (Dunning & Friedman, 2014).T-digest allows SeleDiff to estimate var (Ω) from genome-wide data with only a small amount of memory (Figure 3).In summary, SeleDiff can help researchers detect and quantify natural selection from massive genomes in this era of big data.

Figure 1 :
Figure 1: The demographic models in simulation.