dataquieR: assessment of data quality in epidemiological research

dataquieR is an R package to conduct data quality assessments in data collections designed for research. It makes strong use of metadata that specify the requirements of the study data. Spreadsheet tables can be used to collect this information in a standardized manner. dataquieR starts with checking the formal compliance of study data with expectations defined in the metadata, such as the data type, during integrity analyses. Depending on available metadata, further data quality assessments cover the dimensions completeness, consistency, and accuracy as proposed by the framework of Schmidt et al. (2020). Three dataquieR functions investigate the completeness of data within and across observational units. Consistency-related analysis comprises two aspects. First, depending on the data type, the compliance of data elements with either user-defined limits or the adherence to expected value lists is investigated. Second, contradictions between data values of two data elements can be identified by using one of eleven logical comparisons, e.g., if systolic blood pressure is lower than diastolic blood pressure whereas the opposite is expected. Eight dataquieR functions support accuracy-related analyses by aiming at unexpected distributions of single or multiple data elements. Particular focus is placed on the influence of observers, examiners, and devices on the measurement process.


Summary
dataquieR is an R package to conduct data quality assessments in data collections designed for research. It makes strong use of metadata that specify the requirements of the study data. Spreadsheet tables can be used to collect this information in a standardized manner. dataquieR starts with checking the formal compliance of study data with expectations defined in the metadata, such as the data type, during integrity analyses. Depending on available metadata, further data quality assessments cover the dimensions completeness, consistency, and accuracy as proposed by the framework of Schmidt et al. (2020). Three dataquieR functions investigate the completeness of data within and across observational units. Consistency-related analysis comprises two aspects. First, depending on the data type, the compliance of data elements with either user-defined limits or the adherence to expected value lists is investigated. Second, contradictions between data values of two data elements can be identified by using one of eleven logical comparisons, e.g., if systolic blood pressure is lower than diastolic blood pressure whereas the opposite is expected. Eight dataquieR functions support accuracy-related analyses by aiming at unexpected distributions of single or multiple data elements. Particular focus is placed on the influence of observers, examiners, and devices on the measurement process.

Statement of Need
Various data quality concepts have been proposed to evaluate data's "fitness for use" including different definitions of terms and focus areas (Cai & Zhu, 2015). To comprehend differences underlying these approaches, Keller et al. (2017) stressed the importance to differentiate between (a) designed data collections, (b) administrative data, and (c) opportunity data. Kahn et al. (2016) had already proposed a concept of data quality tailored for electronic health records (EHR) data. Schmidt et al. (2020) have recently introduced a framework addressing specifically the requirements of designed research data collections. Data collected for research purposes differs substantially from EHR data as the researchers are involved in the design, the conduct and the control of the measurement process. Further, enriched metadata, describing the collected data elements beyond datatypes and labels, is commonly available, as well as process information, i.e. the circumstances under which data have been generated (Richter et al., 2019). dataquieR was developed to make specific use of metadata and process information for data quality assessments in designed data collections, and to complement a data quality framework for research data collections.
Second, all exported functions of dataquieR may be applied individually to create customized reports. Besides potential modifications of the output, this approach allows for inclusion of transformed or new data elements created during the quality assessment.
Sample output of both approaches are shown in Figure 1. SummaryTable(s), returned as dataframes, and ggplot2 (Wickham, 2016) objects (SummaryPlot, SummaryPlotList) are the most frequently used outputs of dataquieR. dataquieR adds to versatile R packages assessing data quality such as validate (Loo & Jonge, 2019), smartEDA (Putatunda et al., 2019), DataExplorer (Cui, 2019), and dataMaid (Petersen & Ekstrøm, 2019) in enabling R users to create extensive data quality reports. The full functionality of dataquieR rests on the existence of well-defined metadata. Therein, one row of the metadata corresponds to one data element of the study data (Richter et al., 2019); currently up to 20 attributes can be used by dataquieR. Such attributes comprise, e.g., the data type, missing codes, different types of limits in interval notation (e.g. "[0; Inf)" for float-type data), value codes (e.g. "1=female | 2=male" for nominal data), distributional assumptions, and the keys to process variables describing the measurement process. While such information can be set up without programming knowledge, the efforts to create such metadata for large numbers of data elements are considerable. Yet, appropriate metadata increase research data FAIRness (Wilkinson et al., 2018) and transparency of research.
For further details regarding the concept and metadata requirements please visit the companion website.