TAXPASTA: TAXonomic Profile Aggregation and STAndardisation

1 Unseen Bio ApS, Copenhagen, Denmark 2 Microbiome Sciences Group, Department of Archaeogenetics, Max Planck Institute for Evolutionary Anthropology, Leipzig, Germany 3 Associated Research Group of Archaeogenetics, Leibniz Institute for Natural Product Research and Infection Biology Hans Knöll Institute, Jena, Germany 4 Department of Microbiology, Tumor and Cell Biology, Karolinska Institute, Solna, Sweden 5 Department of Paleobiotechnology, Leibniz Institute for Natural Product Research and Infection Biology Hans Knöll Institute, Jena, Germany ¶ Corresponding author DOI: 10.21105/joss.05627


Summary
Metagenomic analysis is largely concerned with untargeted genetic characterisation of the taxonomic and functional composition of whole communities of organisms. Researchers ask questions from metagenomic sequencing such as 'who is present' (what organisms are present), and 'what are they doing' (which functions are they performing)? The nature of this field is such that it intersects with ecology, medicine, statistics, and bioinformatics. Facilitated by the development of Next-Generation Sequencing (NGS), the field often generates large datasets consisting of many samples (hundreds) and many sequencing reads (tens of millions).
In part, due to the interdisciplinary nature of the field, but more importantly, due to the lack of a gold standard, the task of accurately identifying the taxonomic origin of each sequencing read is a popular and unresolved bioinformatics problem. Furthermore, the sizes of the datasets present interesting challenges for computational efficiency, which may require trading off accuracy for speed and memory use. Thus, there exists a diverse number of bioinformatics tools in order to analyse metagenomic sequencing data and produce metagenomic profiles. However, most of those tools have invented their own (often tabular) result formats, which complicates downstream analysis and in particular comparison across tools.
TAXPASTA is a standalone command-line tool written in Python (Van Rossum & Drake Jr, 1995) that aims to standardise the diverse range of metagenomic profiler output formats to simple tabular formats that are readily consumed in downstream applications. TAXPASTA facilitates cross-comparison between taxonomic profiling tools without the need for external or dedicated modules or plugins needed of other 'dedicated' metagenomic profile formats.

Statement of need
TAXPASTA is a Python package for standardising and aggregating metagenomic profiles coming from a wide range of tools and databases ( Figure 1). It was developed as part of the nf-core/taxprofiler pipeline 1 within the nf-core community (Ewels et al., 2020).
Across profilers, relative abundances can be reported in read counts, fractions, or percentages, as well as any number of additional columns with extra information. Taxa can be recorded using taxonomic identifiers, taxonomic names and/or in some cases semi-colon-separated taxonomic 'paths' (lineages). These can also be formatted in different ways, from typical tables, to including 'indented' taxonomy trees such as in the Kraken (Wood et al., 2019) family of profilers. Manually parsing these for comparison can be an arduous, error-prone task, with researchers often reverting to custom R (R Core Team, 2023) and Python scripting, or even manual correction in spreadsheet software.
With TAXPASTA, all of those formats can be converted into a single, standardised output, that, at a minimum, contains taxonomic identifiers and their relative abundances as integer counts. It can also be used to aggregate profiles across samples from the same profiler and merge them into a single, standardised table. Having a singular format facilitates downstream analyses and comparisons. TAXPASTA is not the first tool to attempt standardising metagenomic profiles, but it is by far the most comprehensive in terms of supported profilers and output formats. There exists an initiative to benchmark and compare profilers, as well as provide guidance on their fitness for purpose; the Critical Assessment of Metagenome Interpretation (CAMI) challenges (Meyer et al., 2022;Sczyrba et al., 2017). For that initiative, the Open-community Profiling Assessment tooL (OPAL) (Meyer et al., 2019) was developed. Creating a community wide assessment faced many of the challenges presented here, however, the chosen solution was to mandate a single output format 2 for all profilers participating in the challenge. Furthermore, OPAL is an integrated tool performing assessment and visualisation, whereas TAXPASTA follows the UNIX philosophy 3 of doing one thing and doing it well. The BIOM format (McDonald et al., 2012) was created with a similar intention of standardising a storage format for microbiome analyses. However, transforming metagenomic profiles into that format is entirely left up to the user. The format also is not easily loadable into spreadsheet software, and external libraries are required for loading the format into data analysis languages such as R. The QIIME™2 next-generation microbiome bioinformatics platform (Bolyen et al., 2019) also maintains internally consistent formats for storing and processing metagenomic data that new tools can plug into, however this suite of software was originally designed for the analysis of 16S rRNA amplicon sequencing data (Caporaso et al., 2010), and whole-genome, shotgun metagenomic sequencing data is only supported via community plugins 4 . While some of the taxonomic profilers also come with scripts to convert their output into another format and/or merge multiple profiles into a single table, such as the Krakentools companion package (Lu et al., 2022), these are often focused on the specific tool or family of tools. Thus, users would have to become proficient in yet another piece of software per tool or family of tools for the sake of consistent output files.
For maximum compatibility, TAXPASTA offers a wide range of familiar output file formats, such as text-based, tabular formats (CSV 6 , TSV 7 ), spreadsheets (ODS 8 , XLSX 9 ), optimised binary formats (Apache Arrow 10 and Parquet 11 ), and the HDF5-based 12 BIOM format (McDonald et al., 2012). We hope that this will let researchers plug and play TAXPASTA into their existing analysis workflows in a wide range of settings.