RNAsik: A Pipeline for complete and reproducible RNA-seq analysis that runs anywhere with speed and ease

RNA sequencing (RNA-seq) is one of many applications of high throughput sequencing, in which millions of short sequence reads, typically around 100 bases long, are produced from RNA samples with the aim of characterising entire transcriptomes. In order to analyse RNA-seq data, multiple bioinformatics tools are collected together into a pipeline, in which each tool accepts processed data from the previous tool as its input. The RNAsik pipeline streamlines processing of RNA-seq data and facilitates the production of reproducible results. This pipeline can run standalone on workstations, cloud instances, or on High Performance Computing (HPC) clusters. A single RNAsik run gives a comprehensive overview of the experiment and produces output suitable for Differential Gene Expression (DGE) analysis.


Summary
RNA sequencing (RNA-seq) is one of many applications of high throughput sequencing, in which millions of short sequence reads, typically around 100 bases long, are produced from RNA samples with the aim of characterising entire transcriptomes.In order to analyse RNA-seq data, multiple bioinformatics tools are collected together into a pipeline, in which each tool accepts processed data from the previous tool as its input.The RNAsik pipeline streamlines processing of RNA-seq data and facilitates the production of reproducible results.This pipeline can run standalone on workstations, cloud instances, or on High Performance Computing (HPC) clusters.A single RNAsik run gives a comprehensive overview of the experiment and produces output suitable for Differential Gene Expression (DGE) analysis.
With an alignment based approach, RNAsik incorporates two main steps: 1) alignment of short reads from FASTQ files to a reference genome or transcriptome; and 2) counting the number of reads mapped to annotated genomic features (such as genes).The table of counts generated by RNAsik can be further analysed with any contemporary count-based DGE tools/packages -one such package particularly suited to the task is Degust (Powell, 2015), a powerful front-end and user friendly tool for DGE data analysis, visualisation and exploration.Additionally RNAsik can produce a number of different quality control (QC) metrics, such as sequencing quality metrics reported using FastQC (Bioinformatics, 2011), intra-and inter-genic mapping rates estimation using QualiMap (Okonechnikov, Conesa, & García-Alcalde, 2016), and sequencing library size and GC bias estimation using Picard Tools (Broadinstitute, n.d.).These QC metrics are automatically summarised into a single, dynamic, HTML report generated by MulitQC (Ewels, Magnusson, Lundin, & Käller, 2016).
Other features of the RNAsik pipeline include the ability to mark duplicated reads using Picard Tools (Broadinstitute, n.d.), sorting and indexing of alignments using Samtools (Li et al., 2009) to enable viewing in genome browser applications such as IGV [Robinson2011-du], an enhanced table of counts with additional meta-information about each gene (e.g.biotype and human readable gene names), and ready-to-use coverage plots for every sample using bedtools2 (Quinlan & Hall, 2010) and UCSC tools (Raney et al., 2014).The RNAsik pipeline logs every step of processing including the number of samples and associated FASTQ files, software tool versions and sequencing strand information.
RNAsik is written in BigDataScript (BDS) (Cingolani, Sladek, & Blanchette, 2015), which is a domain-specific language (DSL).BDS generates an additional HTML report alongside a typical RNAsik analysis.In addition to RNAsik internal logging, this report holds system information such as run-time information and the exit status for every tool.RNAsik employs many other useful features within BDS such as inbuilt checkpointing for retries on failure and ability to talk to an HPC cluster queue directly.
RNAsik incorporates commonly used tools such as STAR aligner (Dobin et al., 2013), featureCounts (Liao, Smyth, & Shi, 2014) and samtools (Li et al., 2009).However, one can extend RNAsik with other new tools and features.Recently, RNAsik has been extended to include two other aligners, Hisat2 (Kim, Langmead, & Salzberg, 2015) and BWA-MEM (Li, 2013).This broadens the scope of RNAsik to bacterial RNA-seq analysis and improves diversity.RNAsik is an open-source project under Apache License 2.0, and contributions are welcome.In the near future, there are plans to extend RNAsik in several directions including the incorporation of alignment-free read quantification and an RNA-seq variant calling option (Sun, Bhagwate, Prodduturi, Yang, & Kocher, 2016).RNAsik simplifies and speeds up RNA-seq analysis and automates many of the QC steps that are important but often overlooked.