pycoQC , interactive quality control for Oxford Nanopore Sequencing

Nanopore sequencing of nucleic acids took nearly 30 years to develop and is now firmly established as an alternative to sequencing by synthesis methods (Deamer, Akeson, & Branton, 2016). Oxford Nanopore Technologies (ONT) released the first commercial nanopore device for DNA sequencing in 2014 and has continually improved the technology since then (Jain, Olsen, Paten, & Akeson, 2016). Although the read accuracy is only around 90%, ONT technology can sequence very long molecules and generates data in real time. In addition, RNA can be sequenced directly and modified bases can be detected (Garalde et al., 2018).

The electrical signal acquired by the array of nanopores is stored in HDF5 format, with one file (called FAST5) per molecule sequenced.The signal is then converted into a nucleic acid sequence using basecalling software.There are several alternatives, but the best performers for read accuracy are Albacore or Guppy developed and maintained by ONT (Wick, Judd, & Holt, 2018).Both can generate FASTQ files, FAST5 files containing basecalling information and a text summary file.Although ONT recently released best-practice guidelines for quality control analysis of sequencing runs (Oxford Nanopore Technologies, 2019), it did not provide a turnkey solution to explore the sequencing data quality in depth.

Principle and example output
Briefly, pycoQC imports, filters and preprocesses one or several summary files generated with one of the previously mentioned basecallers.Alternatively, the input file can also be generated with the companion program Fast5_to_seq_summary included with the package.If available, calibration strand and barcoding information are also extracted either from the summary file (Albacore) or from a separate barcoding summary file (Guppy).
Then, a range of plots and tables can be generated to explore the data.pycoQC plots are interactive, allowing users to display all the reads or only those above the quality threshold, to zoom in and to hide legend labels.The command line interface offers a simple and straightforward experience.On the other hand, the Python API for Jupyter notebook gives more flexibility to users who can easily customise and share their analyses.
Example static versions of a selection of the tables and plots produced by pycoQC are presented in Figures 1 to 4.
Figure 1) Sequencing run summary statistics obtained with the summary function.On top of the overall run results, a breakdown per run ID is also displayed.pycoQC counts the number of bases and reads sequenced as well as the number of active channels and the run duration.In addition, the median read length, median read quality and the N50 score are also computed.
Figure 2) 2D density plot of the read length compared with the mean read PHRED quality generated with the reads_len_qual_2D function.This visualisation offers a quick overview of reads quality/length and allows the easy identification of read subpopulations.
Read length and mean quality can also be explored independently using the 1D density plot functions reads_len_1D and reads_qual_1D.
Figure 3) Read and base output over experiment time obtained with the output_over_time function.Both the cumulative and interval yields are displayed together with time points at which 50%, 75%, 90%, 99% and 100% of the reads/bases were sequenced.Users can also follow the evolution of read length and read quality with the len_over_time and qual_over_time functions.

Figure 4 )
Figure 4) Yield over time per individual channel generated with the channels_activi ty function.Although the visualisation does not directly provide information about the flowcell layout, it gives a good overview of the heterogeneity of channels activity at runtime