LTRpred: de novo annotation of intact retrotransposons

Transposable elements (TEs) play a crucial role in altering the genomic landscape of all organisms and thereby massively influence the genetic information passed on to succeeding generations (Sundaram & Wysocka, 2020). In the past, TEs were seen as selfish mobile elements populating host genomes to increase their chances for transgenerational transmission over long evolutionary time scales. This notion of selfish elements is slowly changing (Drost & Sanchez, 2019) and a new picture drawing a complex genetic landscape benefitting both, host and TE, emerges whereby novel forms can arise through random shuffling of genetic material. For example, the tomato fruit shape (Benoit et al., 2019), moth adaptive cryptic coloration that occurred during the industrial revolution (Chuong, Elde, & Feschotte, 2017), and inner cell mass development in human embryonic stem cells (Chuong et al., 2017) were all shown to be driven by TE activity. Thus, the impact of these elements on altering morphological traits is imminent and requires new attention in the light of evolvability. However, TEs tend to degenerate their sequence leaving their fragmented copies considered as junk DNA in host genomes, which hamper assembly and annotation of new genomes.


Summary
Transposable elements (TEs) play a crucial role in altering the genomic landscape of all organisms and thereby massively influence the genetic information passed on to succeeding generations (Sundaram & Wysocka, 2020). In the past, TEs were seen as selfish mobile elements populating host genomes to increase their chances for transgenerational transmission over long evolutionary time scales. This notion of selfish elements is slowly changing (Drost & Sanchez, 2019) and a new picture drawing a complex genetic landscape benefitting both, host and TE, emerges whereby novel forms can arise through random shuffling of genetic material. For example, the tomato fruit shape , moth adaptive cryptic coloration that occurred during the industrial revolution (Chuong, Elde, & Feschotte, 2017), and inner cell mass development in human embryonic stem cells (Chuong et al., 2017) were all shown to be driven by TE activity. Thus, the impact of these elements on altering morphological traits is imminent and requires new attention in the light of evolvability. However, TEs tend to degenerate their sequence leaving their fragmented copies considered as junk DNA in host genomes, which hamper assembly and annotation of new genomes. Nowadays, the de novo detection of transposable elements is performed by annotation tools specifically designed to capture any type of repeated sequence, TE family, or remnant DNA loci that can be associated with known transposable elements within a genome assembly. The main goal of such efforts is to retrieve a maximum number of loci that can be associated with known TEs. If successful, such annotation can then be used to mask host genomes from TE remnants to simplify genomics studies focusing on host genes. Therefore, there is no automatically performed distinction between complete and potentially active TE and their mutated copies.
Here, we introduce the LTRpred pipeline which allows to de novo annotate functional and thus potentially mobile retrotransposons in any given genome assembly. Different from other annotation tools, LTRpred focuses on retrieving structurally intact elements within sequences of genomes rather than characterizing all traces of historic TE activity.
Such functional annotation is most useful when trying to spot retrotransposons responsible for recent reshuffling of genetic material in the tree of life. Detecting and further characterization of those active retrotransposons yields the potential to harness them as mutagenesis agents by inducing transposition bursts in a controlled fashion to stimulate genomic reshaping processes towards novel traits.

LTRpred emerged as valuable tool for diverse TE mobilization studies
LTRpred was successfully used in previous studies to annotate functional retrotransposons for various applications. In detail, LTRpred was used to annotate the retrotransposon family RIDER within the plant kingdom, shown to be involved in tomato fruit shape elongation. Together with experimental evidence, our analyses revealed that RIDER elements can be activated via drought stress and may help plants rich in RIDER activity to better adapt to drought stress conditions . In a complementary study, LTRpred was used to generate a candidate list of potentially mobile retrotransposon families in rice and tomato, which were then confirmed to produce extrachromosomal DNA using the ALE-Seq methodology (Cho et al., 2019). Finally, LTRpred supported efforts to annotate and date functional retrotransposons in tomato and Arabidopsis which led to the finding that chromodomain DNA methyltransferases (CMTs) silence young and intact retrotransposons in distal chromatin whereas older non-functional retrotransposons are affected by small RNAdirected DNA methylation (Wang & Baulcombe, 2020).
Together, potentially functional retrotransposons annotated de novo with LTRpred were subsequently shown to be active and mobile in diverse molecular studies. This approach may stimulate a new wave of research towards understanding the physiological role of functional retrotransposons and to reveal the mechanistic principles of transposon associated evolvability.
In detail, the LTRpred pipeline calls the command line tools suffixerator, LTRharvest (Ellinghaus, Kurtz, & Willhoeft, 2008), and LTRdigest (Steinbiss, Willhoeft, Gremme, & Kurtz, 2009), which are part of the GenomeTools library (Gremme, Steinbiss, & Kurtz, 2013) using customized parameter settings to screen for repeated LTRs, specific sequence motifs such as primer binding sites (PBS), polypurine tract motifs (PPT), and target site duplications (TSD) and for conserved protein domains such as reverse transcriptase (gag), integrase DNA binding domain, integrase Zinc binding domain, RNase H, and the integrase core domain. The LTRharvest and LTRdigest outputs are efficiently parsed by LTRpred and transformed into a tidy data format (Wickham et al., 2019) which subsequently enables automation of false positive curation. Next, open reading frame (ORF) prediction is performed by a customized wrapper function that runs the command line tool usearch (Edgar, 2010). This step allows to automatically filter out retrotransposons that might have conserved protein domains such as an integrase or reverse transcriptase, but fail to have any ORFs and thus might not be expressed. In a third step, retrotransposon family clustering is performed using sequence clustering with vsearch (Rognes, Flouri, Nichols, Quince, & Mahé, 2016) which defines family members by >90% sequence homology of the full element to each other. In a fourth step, an automated hmmer search (Finn, Clements, & Eddy, 2011) against the Dfam database (Hubley et al., 2016) is performed to assign super-family associations such as Copia or Gypsy by comparing the protein domains of de novo predicted retrotransposons with already annotated TEs in the Dfam (https://dfam.org/home) database. In the last step, the de novo annotated 5 prime and 3 prime LTR sequences are used to estimate the evolutionary age of the retrotransposon which should be treated with caution since retrotransposons can undergo reverse-transcriptase mediated recombination (Sanchez, Gaubert, Drost, Zabet, & Paszkowski, 2017).

Example workflow
After installing all prerequisite command line tools (https://hajkd.github.io/LTRpred/ articles/Introduction.html#installation) users can run the LTRpred() pipeline using the default parameter configuration. In the following example, an LTR transposon prediction is performed for parts of the Human Y chromosome.

LTRpred output
The LTRpred() function internally generates a folder named *_ltrpred which stores all output annotation and sequence files.
In detail, the following files and folders are generated by the LTRpred() function: • Folder *_ltrpred -*_ORF_prediction_nt.fsa : Stores the predicted open reading frames within the predicted LTR transposons as DNA sequence.
-*_ORF_prediction_aa.fsa : Stores the predicted open reading frames within the predicted LTR transposons as protein sequence.
-*_LTRpred_DataSheet.tsv : Stores the output table as data sheet.

Visualising functional retrotransposons annotated with LTRpred
Finally, users can visualise the positioning of de novo annotated retrotransposons along the chromosomes. Here, we choose an example based on the yeast genome.

Metagenome scale annotations
LTRpred allows users to generate annotations not only for single genomes but for multiple genomes (metagenomes) using only one pipeline function named LTRpred.meta().
Users can download the biomartr package  to automatically retrieve genome assembly files for the species of interest.