ReferenceSeeker: rapid determination of appropriate reference genomes

Summary The large and growing number of microbial genomes available in public databases makes the optimal selection of reference genomes necessary for many in-silico analyses, e.g. single nucleotide polymorphism detection, scaffolding and comparative genomics, increasingly difficult. Here, we present ReferenceSeeker, a novel command line tool combining a fast kmer profile-based database lookup of candidate reference genomes with subsequent calculation of highly specific average nucleotide identity (ANI) values for the rapid determination of appropriate reference genomes. Pre-built databases for bacteria, archaea, fungi, protozoa and viruses based on the RefSeq database are provided for download. Availability and Implementation ReferenceSeeker is open source software implemented in Python. Source code and binaries are freely available for download at https://github.com/oschwengers/referenceseeker under the GNU GPL3 license. Contact referenceseeker@computational.bio


Introduction
The enormous success and ubiquitous application of next and third generation sequencing has led to a large number of available high-quality draft and complete microbial genomes in the public databases. Today, the NCBI RefSeq database release 90 alone contains 11,060 complete bacterial genomes (Haft et al. 2018) . Concurrently, selection of appropriate reference genomes (RGs) is increasingly important as it has enormous implications for routine in-silico analyses, as for example in detection of single nucleotide polymorphisms, scaffolding of draft assemblies, comparative genomics and metagenomic tasks. Therefore, a rigorously selected RG is a prerequisite for the accurate and successful application of the aforementioned bioinformatic analyses. In order to address this issue several new databases, methods and tools have been published in recent years e.g. RefSeq, DNA-DNA hybridization (Meier-Kolthoff et al. 2013) , average nucleotide identity (ANI) values (Goris et al. 2007) and Mash (Ondov et al. 2016) . Nevertheless, the sheer amount of currently available databases and potential RGs contained therein, together with the plethora of tools available, often require manual selection of the most suitable RGs. To the best of the authors' knowledge, there is currently no such tool providing both an integrated, highly specific workflow and scalable and rapid implementation. ReferenceSeeker was designed to overcome this bottleneck. As a novel command line tool, it combines a fast kmer profile-based lookup of candidate reference genomes (CRGs) from high quality databases with rapid computation of highly specific ANI and conserved DNA values.

Implementation
ReferenceSeeker is a linux command line tool implemented in Python 3. All necessary external binaries are bundled with the software. The tool Itself requires no external dependencies other than Biopython for file input and output.

Databases
ReferenceSeeker takes advantage of taxon-specific custom databases in order to reduce data size and overall runtime. Pre-built databases for the taxonomic groups bacteria, archaea, fungi, protozoa and viruses are provided. Each database integrates genomic as well as taxonomic information comprising genome sequences of all RefSeq genomes with an assembly level 'complete' or whose RefSeq category is either denoted as 'reference genome' or 'representative genome', as well as kmer profiles, related species names, NCBI Taxonomy identifiers and RefSeq assembly identifiers. For convenient and fully automatic updates, we provide locally executable scripts implemented in bash and Nextflow (Di Tommaso et al. 2017) .

Database Lookup of CRGs
To reduce the number of necessary ANI calculations a kmer profile-based lookup of CRGs against custom databases is carried out. This step is implemented via Mash parameterized with a Mash distance of 0.1, which was shown to correlate well with an ANI of roughly 90% (Ondov et al. 2016) and thereby establishing a lower limit for reasonably related genomes. The resulting set of CRGs is subsequently reduced to a configurable number of CRGs (default=100) with the lowest Mash distances.
Determination of RG As a highly specific measure for microbial genome relationships ReferenceSeeker uses ANI and conserved DNA (conDNA) values. The reasoning for the subsequent calculation of ANI and conDNA values is that although Mash distance values correlate well with ANI values, the same cannot be observed for conDNA values (Fig. 1).
Calculation of these metrics is implemented vía Nucmer contained in the MUMmer package (Marçais et al. 2018) as it was recently shown that Nucmer based implementations (ANIn) compare favourably against BLAST+ based implementations (ANIb) in terms of runtime. Given that compared genomes are closely related, i.e. shared ANI is above 90%, it was also shown that ANIn correlates well with ANIb (Yoon et al. 2017) . This is ensured by the prior Mash-based selection of CRGs. As an established threshold for species boundaries (Goris et al. 2007) , results are subsequently filtered by configurable ANI and conDNA values with a default of 95% and 69%, respectively. Finally, CRGs are sorted according to the harmonic mean of ANI and conDNA values in order to incorporate both nucleotide identity and genome coverage of query genomes and resulting CRGs. In this manner, ReferenceSeeker ensures that the resulting RGs sufficiently reflect the genomic landscape of a query genome.

Application
ReferenceSeeker takes as input a microbial genome assembly in fasta format and the path to a taxonomic database of choice. Results are returned as a tabular separated list comprising the following information: RefSeq assembly identifier, ANI, conDNA, NCBI taxonomy identifier, assembly status and organism name. To illustrate the broad applicability at different scales we tested ReferenceSeeker with 12 microbial genomes from different taxonomic groups and measured overall runtimes on a common consumer laptop providing 4 cores and a server providing 64 cores (Table 1). For the tested bacterial genomes, ReferenceSeeker limited the number of resulting RGs to a default maximum of 100 genomes. Runtimes of archaeal and viral genomes are significantly shorter due to a small number of available RGs in the database and overall smaller genome sizes, respectively.