Sourcepredict: Prediction of metagenomic sample sources using dimension reduction followed by machine learning classification

DNA shotgun sequencing of human, animal, and environmental samples has opened up new doors to explore the diversity of life in these different environments, a field known as metagenomics (Hugenholtz & Tyson, 2008). One aspect of metagenomics is investigating the community composition of organisms within a sequencing sample with tools known as taxonomic classifiers, such as Kraken (Wood & Salzberg, 2014).


Method
Starting with a numerical organism count matrix (samples as columns, organisms as rows, obtained by a taxonomic classifier) of merged references and sinks datasets, samples are first normalized relative to each other, to correct for uneven sequencing depth using the geometric mean of pairwise ratios (GMPR) method (default) (L.Chen et al., 2018).
After normalization, Sourcepredict performs a two-step prediction algorithm.First, it predicts the proportion of unknown sources, i.e., which are not represented in the reference dataset.Second, it predicts the proportion of each known source of the reference dataset in the sink samples.

Prediction of the proportion of unknown sources
Let S i ∈ {S 1 , .α ∈ [0, 1]  To predict the proportion of unknown sources, a Bray-Curtis (Bray & Curtis, 1957) pairwise dissimilarity matrix of all S i and U Si k samples is computed using scikit-bio (Rideout et al., 2018).This distance matrix is then embedded in two dimensions (default) with the scikit-bio implementation of PCoA.This sample embedding is divided into three subsets: unk D train (64%), unk D test (20%), and unk D validation (16%).The scikit-learn (Pedregosa et al., 2011) implementation of KNN algorithm is then trained on unk D train , and the training accuracy is computed with unk D test .This trained KNN model is then corrected for probability estimation of the unknown proportion using the scikit-learn implementation of Platt's scaling method (Platt & others, 1999) with unk D validation .The proportion of unknown sources in S i , p u ∈ [0, 1] is then estimated using this trained and corrected KNN model.Ultimately, this process is repeated independently for each sink sample S i of D sink .

Prediction of the proportion of known sources
First, only organism TAXIDs corresponding to the species taxonomic level are retained using the ETE toolkit (Huerta-Cepas, Serra, & Bork, 2016).A weighted Unifrac (default) (Lozupone, Hamady, Kelley, & Knight, 2007) pairwise distance matrix is then computed on the merged and normalized training dataset D ref and test dataset D sink with scikit-bio, using the NCBI taxonomy as a reference tree.This distance matrix is then embedded in two dimensions (default) using the scikit-learn implementation of t-SNE (Maaten & Hinton, 2008).The 2-dimensional embedding is then split back to training tsne D ref and testing dataset tsne D sink .The KNN algorithm is then trained on the train subset, with a five (default) cross validation to look for the optimum number of K-neighbors.The training dataset tsne D ref is further divided into three subsets: tsne D train (64%), tsne D test (20%), and tsne D validation (16%).The training accuracy is then computed with tsne D test .Finally, this second trained KNN model is also corrected for source proportion estimation using the scikit-learn implementation of the Platt's method with tsne D validation .The proportion p cs ∈ [0, 1] of each of the n s sources c s ∈ {c 1 , .., c ns } in each sample S i is then estimated using this second trained and corrected KNN model.

Combining unknown and source proportions
For each sample S i of the test dataset D sink , the predicted unknown proportion p u is then combined with the predicted proportion p cs for each of the n s sources c s of the training dataset such that ∑ ns cs=1 s c + p u = 1 where s c = p cs • p u .Finally, a summary table gathering the estimated sources proportions is returned as a csv file, as well as the t-SNE embedding sample coordinates.
where x i j is sampled from a Gaussian distribution N ( S i (o i j ), 0.01).The ||m|| U Si k samples are then added to the reference dataset D ref , and labeled as unknown, to create a new reference dataset denoted unk D ref .
., S n } be a sample from the normalized sinks dataset D sink , o i j ∈ {o i 1 , .., o i } an organism in S i , and n i o the total number of organisms in S i , with o i j ∈ Z+.Let m be the mean number of samples per source in the reference dataset, such that m = 1 Si ||m|| } to add to the reference dataset to account for the unknown source proportion in a test sample.Separately for each S i , a proportion denoted