covtobed: a simple and fast tool to extract coverage tracks from BAM files

A common task in bioinformatics is the mapping of DNA sequencing reads (produced by “next generation sequencing” experiments) against a reference genome. The output of the alignment is commonly encoded in a BAM file (Li et al., 2009). For several applications of DNA sequencing it is useful to extract the depth of coverage (Sims, Sudbery, Ilott, Heger, & Ponting, 2014) at specific positions in the BAM file, encoding the output in the standard BED format (Quinlan & Hall, 2010).


Summary
A common task in bioinformatics is the mapping of DNA sequencing reads (produced by "next generation sequencing" experiments) against a reference genome. The output of the alignment is commonly encoded in a BAM file (Li et al., 2009). For several applications of DNA sequencing it is useful to extract the depth of coverage (Sims, Sudbery, Ilott, Heger, & Ponting, 2014) at specific positions in the BAM file, encoding the output in the standard BED format (Quinlan & Hall, 2010).
Here we describe covtobed, a C++ program designed to extract the depth of coverage per position from a sorted BAM file, optionally specifying a range of coverage of interest and a minimum length for the features to be printed in the output BED file. Parsing of BAM files is performed using libbamtools (Barnett, Garrison, Quinlan, Strömberg, & Marth, 2011).
The design has been inspired by the UNIX programming philosophy (Wikipedia contributors, 2019), and thus covtobed performs a single task and supports input and output streams.

Availability and Installation
covtobed is distributed with MIT licence and available from the GitHub repository, and can be easily installed via Miniconda from the "bioconda" channel (i. e. conda install -c bi oconda covtobed).
The tool is also available as a Docker image downloadable from Docker Hub (i. e. docker pull andreatelatin/covtobed) or as a Singularity image.

Code (structure and dependencies)
The code is object oriented, including an Input class handling reading, parsing and filtering of alignments and an Output class handling coverage filtering and writing in different formats. The main algorithm is based on a priority_queue from the standard library and is both fast and memory efficient.

Documentation
The package documentation is maintained in the GitHub wiki. The documentation contains examples of usage, example of produced output and details about the package.

Example applications
When performing target enrichment experiments (where the aim is to sequence a set of selected regions of a genome), it's important to detect a lack of coverage or insufficient coverage (i.e. the coverage on target is lower than THRESHOLD). This information can be obtained by intersecting (using bedtools, (Quinlan & Hall, 2010)) the BED file describing the captured target regions (usually supplied by the company producing the kit) with the output of covtobed.
The tool has been used, for example, in the setup of a target enrichment panel targeting 71 human genes (Poloni et al., 2019), in order to detect uncovered regions.
While a tool exists -called mosdepth (Pedersen & Quinlan, 2018) -to perform a coverage analysis, covtobed was designed with the ability to quickly extract regions between userdefined coverage intervals and, more importantly, with streaming from standard input and to standard output, that Mosdepth doesn't support. covtobed is available both for Linux and macOS, while mosdepth is only available for Linux, and this makes covtobed a suitable building block for diverse pipelines (e. g. microbial genomics requires lesser resources and it is not uncommon to perform complete analyses on a laptop).

Performance
covtobed is a fast tool, constantly outperforming the popular bedtools and providing comparable speed with mosdepth. With some datasets, like "gene panels", covtobed is more than ten times faster than mosdepth.
The scripts to perform the benchmark are available in the github repository.