rTASSEL: An R interface to TASSEL for analyzing genomic diversity

The need for efficient tools and applications for analyzing genomic diversity is essential for any genetics research or breeding program. One commonly used tool, TASSEL (Trait Analysis by aSSociation, Evolution, and Linkage), provides many core methods for genomic analyses. Despite its efficiency, TASSEL has limited automation potential for reproducible research and to interact with other analytical tools. Here we present an R package, rTASSEL, that is a front-end to connect to a variety of highly used TASSEL methods and analytical tools. The goal of this package is to create a unified scripting workflow that leverages the analytical prowess of TASSEL, in conjunction with R’s data handling and visualization capabilities, without ever having the user switch between these two environments.


Summary
The need for efficient tools and applications for analyzing genomic diversity is essential for any genetics research or breeding program. One commonly used tool, TASSEL (Trait Analysis by aSSociation, Evolution, and Linkage), provides many core methods for genomic analyses. Despite its efficiency, TASSEL has limited automation potential for reproducible research and to interact with other analytical tools. Here we present an R package, rTASSEL, that is a front-end to connect to a variety of highly used TASSEL methods and analytical tools. The goal of this package is to create a unified scripting workflow that leverages the analytical prowess of TASSEL, in conjunction with R's data handling and visualization capabilities, without ever having the user switch between these two environments.

Statement of need
As breakthroughs in genotyping technologies allow for increasing available variant resources, methods and implementations to analyze complex traits in a diverse array of organisms are needed. One such resource is TASSEL (Trait Analysis by aSSociation, Evolution, and Linkage). This software suite contains functionality for analyses in association studies, linkage disequilibrium (LD), kinship, and dimensionality reduction (e.g., PCA and MDS) (Bradbury et al., 2007). While initially released in 2001, the fifth version, TASSEL 5, has been optimized for handling large data sets and has added newer approaches to association analyses for many thousands of traits (Shabalin, 2012). Despite these improvements, interacting with TASSEL has been limited to either a graphical user interface with limited workflow reproducibility or a command-line interface with a higher learning curve that can dissuade novice researchers and provide unnecessary intermediate files in an analytics workflow (Zhang et al., 2009). To remediate this issue, we have created an R package, rTASSEL. This package interfaces the analytical power of TASSEL with R's data formats and intuitive function handling. Approach Implementation Figure 1: Overview of the rTASSEL workflow. Genotypic and phenotypic data (A) are used to create an R S4 object (B). From this object, TASSEL functionalities can be called to run various association, linkage disequilibrium, and relatedness functions (C). Outputs from these TASSEL analyses are returned to the R environment as data frame objects (D), Manhattan plot visualizations (E), or interactive visualizations for linkage disequilibrium analysis (F). rTASSEL combines TASSEL's abilities to store genotype data as half bytes, bitwise arithmetic for kinship analyses, genotype filtration, extensive forms of linear modeling, multithreading, and access to a range of native libraries while providing access to R's prominent scripting capabilities and commonly used Bioconductor classes (Gentleman et al., 2004;Lawrence et al., 2013;Morgan et al., 2021). Since TASSEL is written in Java, a Java to R interface is implemented via the rJava package (Urbanek, 2021). rTASSEL allows for the rapid import, analysis, visualization, and export of various genomic data structures. Diverse formats of genotypic information can be used as inputs for rTASSEL. These include variant call format (.vcf), HapMap (.hmp.txt), and Flapjack (.flpjk.*). Phenotype data can also be supplied in multiple formats. These include TASSEL formatted data sets or R data frame objects ( Figure 1A).
Once data is imported, the function readGenotypePhenotype is used to construct an S4 object, which is used for all downstream analyses ( Figure 1B, Figure 1C). This object contains slots that exclusively hold references to objects held in the Java virtual machine, which can be called with downstream functions. Prior to analysis, genotype objects can be quickly imported and filtered in several ways to help in the reduction of confounding errors. rTASSEL can filter genotype objects by either variant site properties (filterGenotypeTableSites) or by individuals (filterGenotypeTableTaxa).

Association functions
One of TASSEL's most dynamic functionalities is its capability to perform various association modeling techniques. rTASSEL allows several types of association studies to be conducted using one primary function, assocModelFitter, with different parameter inputs. This allows for implementing both least-squares fixed-effects general linear models (GLM) and mixed linear models (MLM) via the Q + K method (Yu et al., 2006). If no genotypic data is provided to the GLM model, assocModelFitter can calculate best linear unbiased estimates (BLUEs). Additionally, fast GLM approaches are implemented in rTASSEL, which allow for the rapid analysis of many phenotypic traits (Shabalin, 2012).
Linear models can be specified following the format used by R's lm function: where y is phenotype data, and A n is any covariate or factor data. This formula parameter and several other parameters allow the user to run BLUE, GLM, or MLM modeling. Once association analysis is completed, TASSEL table reports of association statistics are generated as an R list which can then be exported as flat files or converted to data frames ( Figure 1D). rTASSEL can also visualize association statistics with the function, manhattanPlot, which utilizes the graphical capabilities of the package, ggplot2 (Wickham, 2016) ( Figure 1E).

Linkage disequilibrium
rTASSEL can also generate linkage disequilibrium (LD) from genotype data via the function linkageDiseq. LD is estimated by the standardized disequilibrium coefficient, D , correlation between alleles at two loci (r 2 ), and subsequent p-values via a two-sided Fisher's exact test. TASSEL table reports for all pairwise comparisons are generated as data.frame objects, and heatmap visualizations for each given metric are generated via TASSEL's legacy LD Java viewer or ggplot2 ( Figure 1F).

Relatedness functions
For users to run MLM methods, relatedness estimates need to be calculated. rTASSEL can efficiently compute this on large data sets by processing blocks of sites at a time using bitwise operations. This can be accomplished using the function kinshipMatrix, which will generate a kinship matrix from genotype data. Several methods for calculating kinship in TASSEL are implemented. By default, a "centered" identity by state (IBS) approach is used (Endelman & Jannink, 2012). Additionally, normalized IBS (Yang et al., 2011), dominance-centered IBS (Muñoz et al., 2014), and dominance normalized IBS (Zhu et al., 2015) can be used. rTASSEL can either generate a reference object for association analysis or an R matrix object via R's as.matrix function for additional analyses. In addition to kinship generation, principal components analysis and multidimensional scaling can be used on genotype data using rTASSEL methods, pca and mds, respectively. Finally, phylogenetic analysis can be performed on genotype data using the createTree method which will generate phylo objects commonly used by the ape package (Paradis & Schliep, 2019). The createTree method allows for two clustering methods: neighbor joining or UPGMA (unweighted pair group method with arithmetic mean).

Genomic prediction
The function genomicPrediction can be used for predicting phenotypes from genotypes. To do this, genomicPrediction uses genomic best linear unbiased predictors (gBLUPs). It proceeds by fitting a mixed model that uses kinship to capture covariance between taxa. The mixed model can calculate BLUPs for taxa that do not have phenotypes based on the phenotypes of lines with relationship information.

Additional resources
More information about various functionalities and workflows can be found on our project webpage. Source code can be found on our GitHub repository. An interactive Jupyter notebook session detailing additional rTASSEL workflows can be found on Binder.