dtrackr: An R package for tracking the provenance of data

An accurate statement of the provenance of data is essential in biomedical research. Powerful data manipulation tools

dtrackr is first and foremost a utility to accelerate and improve research by facilitating documentation, supporting extraction of knowledge from data sets, and the execution of research by helping identify data quality issues. The general capability however fits into a broader context of other provenance or data pipeline research. This includes initiatives such as C2Metadata (Alter et al., 2021), which focus on a language independent representation of a data pipeline, and R packages such as targets (Landau, 2021) which focus on documenting pipeline code, and managing the execution of a pipeline, or RDataTracker which focusses on tracking the execution of a arbitrary R script (Lerner et al., 2018). dtrackr takes a more data oriented approach, which could be complementary, in which we remain agnostic to the detail of a data pipeline script or nature of its execution, but capture a subset of the transformations applied to data alongside the data itself, thereby documenting the data state as it is being manipulated. This is achieved by overriding the execution of dplyr pipeline functions and results in a retrospective record of provenance (Pimentel et al., 2019). dtrackr also has the ability to insert secondary analysis as annotations into the pipeline, and allows control over what information is collected, ultimately with a view to producing simple human readable output. The approach of dtrackr is analogous to a git commit history for dataframes, and there is potential synergy with emerging versioned databases such as dolt (Dolt Is Git for Data!, 2019Data!, /2022Ross, 2022).

Statement of need
The collection of experimental or observational data for research is often an iterative endeavour, involving curation of complex data sets designed for multiple goals. Systematic data quality checking for such sets is a major challenge, particularly when they are assembled to identify emerging or rapidly evolving issues. Feedback from early data analysis can identify specific data quality issues, resolution of which can considerably improve data for the task at hand. However this requires a clear understanding of why and when individual data items are excluded, which is potentially tedious and may be seen as lower priority compared to statistical analysis. Data analysis using tidyverse in R is a rapid means of transforming raw data into a format suitable for statistical analysis. The transformations involved can, however affect the results of statistical analysis, and meticulous care must be taken to ensure that any assumptions made during data processing are well documented. It is often too easy to inadvertently exclude data where filtering on missing items, or joining linked data sets with incomplete foreign key relationships.
In complex data analysis, the use of interactive programming environments such as Read-Eval-Print Loops (REPL) in R markdown documents, interim caching of results, or conditional branching data pipelines, can result in the current state of a processed data set becoming decoupled from the code that is designed to generate them.
To surface these issues biomedical journal articles are usually required to report data manipulation to an agreed standard. For example, CONSORT diagrams are part of the requirements in reporting parallel group clinical trials. They are described in the updated 2010 CONSORT statement (Schulz et al., 2010), and clarify how patients were recruited, selected, randomized and followed up. For observational studies, such as case control designs, an equivalent requirement is the STROBE statement (von Elm et al., 2008). There are many other similar requirements for other types of study, such as the TRIPOD statement for multivariate models (Collins et al., 2015). Maintaining such CONSORT diagram over the course of a study when data sets are being actively collected and data quality issues being addressed is time-consuming. dtrackr addresses these issues by instrumenting a commonly used subset of standard tidyverse data manipulation pipeline functions from dplyr and tidyr. It can automatically record the steps taken, records excluded and a summary of the result of each data processing step, as part of the data set itself in a "history graph". In this way data sets retain an accurate history of their own provenance regardless of the actual route taken to assemble them. This history includes a complete record of any data quality issues that lead to excluded records. The history is a directed graph which can be expressed in the commonly used GraphViz language (Gansner & North, 2000) and may be visualised as a flowchart such as in Figure 1; this uses the Chronic Granulomatous Disease dataset from the survival package (Terry M. Therneau & Patricia M. Grambsch, 2000;Therneau, 2022) as an example of a parallel group study and produces a STROBE like flowchart.  Figure 1: An example flowchart derived directly from a simple analysis of the Chronic Granulomatous Disease dataset demonstrating use of dtrackr to generate the key parts of a STROBE or CONSORT diagram. dtrackr was originally conceptualized during an analysis I undertook of the severity of the Alpha variant of SARS-CoV-2 (Challen et al., 2021), and has since been used for other epidemiological studies including an analysis of the incidence of hospitalization of acute lower respiratory tract disease in Bristol , and a comparative analysis of the severity of the SARS-CoV-2 Omicron variant, versus the Delta variant against a range of hospital outcomes (Hyams, Challen, Marlow, et al., 2022).
Although the specific example presented here is in the biomedical domain, tracking the provenance of data is a much broader issue, and we anticipate there are many other applications for dtrackr.