CRBHits: From Conditional Reciprocal Best Hits to Codon Alignments and Ka/Ks in R

The Reciprocal Best Hit (RBH) approach is commonly used in bioinformatics to show that two sequences evolved from a common ancestral gene. In other words, RBH tries to find orthologous protein sequences within and between species. These orthologous sequences can be further analysed to evaluate protein family evolution, infer phylogenetic trees and to annotate protein function (Altenhoff et al., 2019). The initial sequence search step is classically performed with the Basic Local Alignment Search Tool (blast) (Altschul et al., 1990) and due to evolutionary constraints, in most cases protein coding sequences are compared between two species. Downstream analysis use the resulting RBH to cluster sequence pairs and build so-called orthologous groups like e.g. OrthoFinder (Emms & Kelly, 2015) and other tools.

: Overview of the two main pipeline function and its subtasks. cds2rbh(): from CDS to CRBHit pairs; rbh2kaks(): from CRBHit pairs to Ka/Ks values.

Coding sequence analysis and synteny
Calculating synonymous (Ks) and nonsynonymous substitutions (Ka) per orthologous sequence pair is a common task for evolutionary biologists, since its ratio Ka/Ks can be used as an indicator of selective pressure acting on a protein (Kryazhimskiy & Plotkin, 2008). However, this task is computational more demanding and consist of at least two steps, namely codon sequence alignment creation and Ka/Ks calculation. Further, the codon sequence alignment step consist of three subtasks, namely coding nucleotide to protein sequence translation, pairwise protein sequence alignment calculation and converting the protein sequence alignment back into a codon based alignment.
Downstream of CRBH creation, CRBHits features all above mentioned steps and subtasks. CRBHits has the ability to directly create codon alignments within R with the help of the widely used R package Biostrings (Pagès et al., 2017) (more than 200k downloads per year since 2014). These codon alignments can be subsequently used to calculate synonymous and nonsynonymous substitutions per sequence pair and is implemented in a multithreaded fashion either via the R package seqinr (Charif & Lobry, 2007) or the use of an R external tool KaKs_Calculator2.0 (Wang et al., 2010).
As gene duplication is one driving force in evolution (Ohno, 1970), the classification of genes as duplicates is one important step to provide us with insights into the molecular events responsible for the current genome architecture of species (Haas et al., 2004). New longread sequencing technology make more and more chromosome scale assemblies for model and non-model species available. The resulting chromosomal gene order information can be used with sequence similarity scores to classify genes into different types of duplication events, like tandem duplicates or chromosomal segments (syntenic regions) derived from e.g. wholegenome duplication. CRBHits features this classification step via the integration of the R external tool DAGchainer (Haas et al., 2004) and offers the possibility to directly link it with evolutionary rate estimations (see Figure 3).

Implementation
Like shmlast, CRBHits benefits from the blast-like sequence search software LAST (Kiełbasa et al., 2011) and plots the fitted model of the CRBH E-value based algorithm. In addition, users can filter the hit pairs prior to CRBH fitting for other criteria like query coverage, protein identity and/or the twilight zone of protein sequence alignments according to Rost (1999). The implemented filter uses equation 2 (see Rost, 1999): where x hit pair is the expected protein identity given the alignment length L hit pair . If the actual protein identity of a hit pair exceeds the expected protein identity (pident hit pair ≥ f (x hit pair )), it is retained for subsequent CRBH calculation.
In contrast to previous implementations, CRBHits only take coding nucleotide sequences (CDS) as the query and target inputs. This is due to the downstream functionality of CRBHits to directly calculate codon alignments within R, which rely on CDS. The inputs are translated into protein sequences, aligned globally (Smith & Waterman, 1981) and converted into codon alignments.
Functions are completely coded in R and only the external prerequisites (LAST, KaKs_Calculator2.0 and DAGchainer) need to be compiled. However, all of them are forked within CRBHits and can be easily build with the dedicated R functions make.last(), make.KaKs_Calculator2() and make.dagchainer(). Further, users can create their own RBH filters before CRBH calculation.

Functions and Examples
The following example shows how to obtain CRBHit pairs between the coding sequences of Schizosaccharomyces pombe (fission yeast) (Wood et al., 2012) and Nematostella vectensis (starlet sea anemone) (Apweiler et al., 2004) by using two URLs as input strings and multiple threads for calculation.

#get help ?rbh2kaks
Given the annotated chromosomal gene positions it is also possible to assign tandem duplicated genes per chromosome and directly compute chains of syntenic genes via the use of the R external tool DAGchainer (Haas et al., 2004). Here, Arabidopsis thaliana is compared to itself (so called selfblast) and syntenic groups vsiualized by their Ks values.

Availability
CRBHits is an open source software made available under the MIT license. It can be installed from its gitlab repository using the devtools package.