containerit: Generating Dockerfiles for reproducible research with R

Linux containers have become a promising tool to increase transparency, portability, and reproducibility of research in several domains and use cases: data science (Boettiger, 2015), software engineering research (Cito & Gall, 2016), multi-step bioinformatics pipelines (Kim, Ali, Lijeron, Afgan, & Krampis, 2017), standardised environments for exchangeable software (Belmann et al., 2015), computational archaeology (Marwick, 2017), packaging algorithms (Hosny, Vera-Licona, Laubenbacher, & Favre, 2016), or geographic object-based image analysis (Knoth & Nüst, 2017). Running an analysis in a container increases reliability of a workflow, as it can execute packaged code independently of the author’s computer and its available configurations and dependencies. However, capturing a computational environment in containers can be complex, making container use difficult for domain scientists with limited programming experience. containerit opens up the advantages of containerisation to a much larger user base by assisting researchers, who are unfamiliar with Linux, command lines or containerisation, in packaging workflows based on R (R Core Team, 2018) in container images by using only user-friendly R commands.


Statement of Need
Linux containers have become a promising tool to increase transparency, portability, and reproducibility of research in several domains and use cases: data science (Boettiger, 2015), software engineering research (Cito & Gall, 2016), multi-step bioinformatics pipelines (Kim, Ali, Lijeron, Afgan, & Krampis, 2017), standardised environments for exchangeable software (Belmann et al., 2015), computational archaeology (Marwick, 2017), packaging algorithms (Hosny, Vera-Licona, Laubenbacher, & Favre, 2016), or geographic object-based image analysis (Knoth & Nüst, 2017).Running an analysis in a container increases reliability of a workflow, as it can execute packaged code independently of the author's computer and its available configurations and dependencies.However, capturing a computational environment in containers can be complex, making container use difficult for domain scientists with limited programming experience.containerit opens up the advantages of containerisation to a much larger user base by assisting researchers, who are unfamiliar with Linux, command lines or containerisation, in packaging workflows based on R (R Core Team, 2018) in container images by using only user-friendly R commands.
Recently containerisation took off as a technology for packaging applications and their dependencies for fast, scalable, and secure sandboxed deployments in cloud-based infrastructures (cf.Osnat, 2018).The most widely used containerisation software is Docker with the following core building blocks (cf.Docker: Get Started): The image is built from the instructions in a recipe called Dockerfile.The image is executed as a container using a container runtime.An image can be moved between systems as a file (image tarball) or based on an image registry.A Dockerfile may use the image created by another Dockerfile as the starting point, a so-called base image.While containers can be manually altered, the common practice is to conduct all configurations with the scripts and instructions originating in the Dockerfile.
An important advantage of containers over virtual machines is that their duality between recipe and image provides and additional layer of transparency and safeguarding.The Dockerfile and image can be published alongside a scientific paper to support peer review and, to some extent, preserve the original results (Nüst et al., 2017).Even if an image cannot be executed or a Dockerfile can no longer be built, the instructions in the Dockerfile are humanreadable, and files in the image can be extracted to recreate an environment that closely resembles the original.Further useful features are (a) portability, thanks to a single runtime dependency, which allows readers to explore an author's virtual laboratory, including complex dependencies or custom-made code, either on their machines or in cloud-based infrastructures (e.g., by using Binder, see Project Jupyter et al., 2018), and (b) transparency, because an image's filesystem can be easily inspected.This way, containers can enable verification of reproducibility and auditing without requiring reviewers to manually download, install, and re-run analyses (Beaulieu-Jones & Greene, 2017).
Container preservation is an active field of research (Emsley & De Roure, 2018;Rechert et al., 2017).It is reasonable to assume that key stakeholders interested in workflow preservation, such as universities or scientific publishers, should be able to operate container runtimes on a time scale comparable to data storage requirements by funding agencies, e.g., 10 years in case of the German DFG or British EPSRC.To enable and leverage the stakeholders' infrastructure, container creation must become easier and more widespread.
The package containerit's main contribution is that it allows for automated capturing of runtime environments as Dockerfiles based on literate programming workflows (Gentleman & Lang, 2007) to support reproducible research.Together with stevedore (FitzJohn, 2019), containerit enables a completely R-based creation and manipulation of Docker containers.Using containerit only minimally affects researchers' workflows because it can be applied after completing a workflow, while at the same time the captured snapshots can enhance the scholarly publication process (in particular review, interaction, and preservation) and may form a basis for more reusable and transparent publications.In the future, containerit may support alternative container software such as Singularity (Kurtzer, Sochat, & Bauer, 2017), enable parametrisation of container executions and pipelines as demonstrated by Kliko (Molenaar, Makhathini, Girard, & Smirnov, 2018), or support proper accreditation of software (Jones et al., 2017;D. S. Katz & Chue Hong, 2018).

Related Work
renv is an R package for managing reproducible environments for R providing isolation, portability, and pinned versions of R packages, but it does not handle system dependencies.The Experiment Factory similarly focuses on ease of use for creating Dockerfiles for behavioural experiments, yet it uses a CLI-based interaction and generates extra shell scripts to be included in the images.ReproZip (Chirigati, Rampin, Shasha, & Freire, 2016) packages files identified by tracing in a self-contained bundle, which can be unpacked to a Docker container/Dockerfile.In the R domain, the package dockerfiler (Fay, 2018) provides an object-oriented API for manual Dockerfile creation, and liftr (Xiao, 2018) creates a Dockerfile based on fields added to the metadata header of an R Markdown document.automagic (Brokamp, 2017), Whales, dockter, and repo2docker use static program analysis to create environment descriptions from common project configuration files for multiple programming languages.Namely, automagic analyses R code and can store dependencies in a bespoke YAML format.Whales and dockter provide different formats, including Dockerfile.Finally, repo2docker primarily creates containers for interactive notebooks to run as a Binder (Project Jupyter et al., 2018) but does not actively expose a Dockerfile.None of them apply the strict code execution approach as containerit does.