phylosmith: an R-package for reproducible and efficient microbiome analysis with phyloseq-objects

This paper presents phylosmith , an R -package that enables reproducible and efficient analysis of microbiome data with phyloseq-class objects by providing robust and efficient functions. phylosmith utilizes the standardized data format of phyloseq and R object accession meth-ods to provide functions with simple and intuitive input arguments. The functions provided in phylosmith have been divided into 3 categories.


Graphs
The graphs are designed to serve as a quick and easy way to visualise data for analysis and to provide a foundation for figures for publishing.Graphics include ordinations, phylogeny profiles, and co-occurrence networks.All images are produced as ggplot objects (Wickham (2016)), allowing for the image to be altered and additional layers given to tailor the graphic as desired.Additionally, the code for producing the graphs is readily accessible, allowing for the code to be reused and tailored to fit needs, providing a foundation to start from.The most novel, for the field of microbiome research, is the implementation of a t-SNE ordination.Most studies have used PCA or NMDS, which can suffer from converging to a local minima on large datasets, t-SNE is designed for large datasets and is not susceptible to these same limitations.

Calculations
As of publication of this paper, the functions in this section all pertain to calculating and analyzing the Spearman rank co-occurrence.The routine was written in efficient C++ code and interfaced with R using the Rcpp API (Eddelbuettel et al. (2011)).The resulting cooccurrence table matches that produced by the cor() function in the R stats package, but is calculated much faster on a single thread, with a multi-threading options implemented as well.

Need
Adoption of data-standards enable data that are readily available for sharing and also the creation and implementation of tools for reproducible research.It is commonly said that in the age of big-data, biologists are required to have computational proficiency and literacy (Carey, 2018).It seems reasonable that there should be a large onus on bioinformaticians to create accessible and practical tools that enable the biologists.
For the field of microbiome research, a formulaic approach to analysis has developed as commonplace.A generic study will incorporate some combination and implementation of the same resulting figures; ordination, profile bar-chart, heatmap, network, etc (Huttenhower et al., 2012), (Turnbaugh et al., 2009), (Arumugam et al., 2011).Each new scientist, often from a biology, microbiology, ecology, or environmental science background, is required to learn how to produce these analyses and figures.A lot of time is spent learning how to generate these plots.Even more time is spent learning how to process data; which can easily be done incorrectly without being apparently obvious (i.g., incorrect logical subsetting, factor levels set incorrectly, or even reordering of samples due to string sorting methods), leading to incorrect results and conclusions.
For microbiome researchers using the R statistical programming language (R Core Team, 2014), a data-standard has been available in phyloseq (McMurdie & Holmes, 2013).ph yloseq provides an S4-class object that contains a count table, taxa table, and associated metadata, along with a phylogenetic tree slot and reference sequence slot.For beginning and intermediate, users of R, S4 objects can be a barrier, as they require an additional layer of accession methods compared to the base S3 objects.phyloseq offers several functions for handling its objects, as well as functions for producing some common figures, but is by no means a complete toolset.Additionally, when the authors originally wrote phyloseq, advanced tools such as the data.tablepackage (Dowle et al., 2014) were not practically available and thus had not been implemented within the program.
Providing tools for reproducible and efficient research can help microbiome researchers to focus more effort on answering biological questions.Providing simple implementations of tools, such as t-SNE (Maaten & Hinton, 2008), can increase the acceptance and adoption of new techniques in a field that is hesitant to do so.The importance of these tools should not be overlooked for the importance of science and understanding as a whole.