restez : Create and Query a Local Copy of GenBank in R

Downloading sequences and sequence information from GenBank (Benson et al., 2013) and related NCBI databases is often performed via the NCBI API, Entrez (J. Ostell, 2002). Entrez, however, has a limit on the number of requests, thus downloading large amounts of sequence data in this way can be inefficient. For situations where a large number of Entrez calls is made, downloading may take days, weeks or even months and could result in a user’s IP address being blacklisted from the NCBI services due to server overload. Additionally, Entrez limits the number of entries that can be retrieved at once, requiring a user to develop code for querying in batches.

amounts of sequence data in this way can be inefficient.For situations where a large number of Entrez calls is made, downloading may take days, weeks or even months and could result in a user's IP address being blacklisted from the NCBI services due to server overload.Additionally, Entrez limits the number of entries that can be retrieved at once, requiring a user to develop code for querying in batches.
The restez package (D.J. Bennett, 2018a) aims to make sequence retrieval more efficient by allowing a user to download the GenBank database, either in its entirety or in subsets, to their local machine and query this local database instead.This process is more time efficient as GenBank downloads are made via NCBI's FTP server using compressed sequence files.With a good internet connection and a computer with currently standard capabilities, a database comprising 7 GB of sequence information (i.e. the total sequence data available for Rodentia as of 27 June 2018) can be generated in less than 10 minutes.(For an outline of the functions and structure of restez, see Figure 1.)

Rentrez integration
rentrez (Winter, 2017) is a popular R package for querying NCBI's databases via Entrez in R. To maximize the compatibility of restez, we implemented wrapper functions with the same names and arguments as the rentrez equivalents.Whenever a wrapper function is called the local database copy is searched first.If IDs are missing in the local database a secondary call to Entrez is made via the internet.This allows for easy employment of restez in scripts and packages that are already using rentrez.At a minimum, a user currently using rentrez will only need to create a local subset of the GenBank database, call restez instead of rentrez and ensure the restez database is connected.

Examples A small example
After a restez database has been set-up, we can retrieve all the sequences from an rentrez::entrez_search() with a single command.

A large example
phylotaR is an R package for retrieving and identifying orthologous sequence clusters from GenBank as a first step in a phylogenetic analysis (D. Bennett et al., 2018).Because the package runs an automated pipeline, multiple queries to GenBank via Entrez are made using the rentrez package.As a result, for large taxonomic groups containing wellsequenced organisms the pipeline can take a long time to complete.

library(phylotaR)
# run phylotaR pipeline for New World Monkeys txid <-9479 # taxonomic ID setup(wd = 'nw_monkeys', txid = txid) run(wd = wd) # ^takes around 40 minutes We can download and create a local copy of the primates GenBank locally and re-run the above code with a library call to restez for speed-up gains and increased code reliability.

Figure 1 :
Figure 1: The functions and file structure for downloading, setting up and querying a local copy of GenBank to a local directory in which database will be stored # Make sure you have sufficient disk space!restez_path_set(filepath = 'restez_db') db_download(db = 'nucleotide') # Interactively download GenBank data db_create(db = 'nucleotide') Now when re-running the first phylotaR code block with the inclusion of the restez package, the procedure completes approximately eight times faster.# run phylotaR again library(phylotaR) library(restez) restez_path_set(filepath = 'restez_db') txid <-9479 setup(wd = 'nw_monkeys', txid = txid) run(wd = wd) # ^takes around 5 minutes For more detailed and up-to-date examples and tutorials, see the restez GitHub page (D.J. Bennett, 2018b).