coder: An R package for code-based item classification and categorization

Registry based research and the use of real world evidence (RWE) and data (RWD) have gained popularity over the last years (Sherman et al., 2016), both as an epidemiological research tool, and for monitoring post market safety and adverse events due to regulatory decisions. Data from administrative, clinical and medical registries are often coded based on standardized classifications for diagnostics, procedures/interventions, medications/medical devices and health status/functioning.


Medical coding and classifications
Registry based research and the use of real world evidence (RWE) and data (RWD) have gained popularity over the last years (Sherman et al., 2016), both as an epidemiological research tool, and for monitoring post market safety and adverse events due to regulatory decisions. Data from administrative, clinical and medical registries are often coded based on standardized classifications for diagnostics, procedures/interventions, medications/medical devices and health status/functioning. Codes and classifications are maintained and developed by several international bodies, such as The World Health Organization (WHO), SNOMED International, and the Nordic Medico-Statistical Committee (NOMESCO).

Challanges
Common classifications such as the International Classification of Diseases (ICD) or the Anatomical Therapeutic Chemical Classification System (ATC) entails thousands of codes which are hard to use and interpret in applied research. This is often solved by an abstraction layer combining individual codes into broader categories, sometimes further simplified by a single index value based on a weighted sum of individual categories (Charlson et al., 1987;Elixhauser et al., 1998;Pratt et al., 2018;Quan et al., 2005;Sloan et al., 2003).

Statement of Need
Large and long-standing national databases often contain millions of entries and span several Gigabytes (GB) in size. This leads to high computational burden and a time-consuming data managing process, a cumbersome but necessary prerequisite before any relevant analysis can be performed. There are several R-packages with a deliberate focus on comorbidity data coded by ICD and summarized by the Charlson or Elixhauser comorbidity indices (icd, comorbidity (Gasparini, 2018) and medicalrisk). The coder package includes such capabilities as well, but takes a more general approach to deterministic item classification and categorization.

The coder package
coder is an R package with a scope to combine items (i.e. patients) with generic code sets, and to classify and categorize such data based on generic classification schemes defined by regular expressions. It is easy to combine different classifications (such as multiple versions of ICD, ATC or NOMESCO codes), with different classification schemes (such as Charlson, Elixhauser, RxRisk V or for example local definitions of adverse events after total hip arthroplasty) and different weighted indices based on those classifications. The package includes default classification schemes for all those settings, as well as an infrastructure to implement and visualize custom classification schemes. Additional functions simplify identification of codes and events within limited time frames, such as comorbidity during one year before surgery or adverse events within 30 days after. coder can also be used in tandem with decoder, a package facilitating interpretation of individual codes.
coder has been optimized for speed and large data sets using reference semantics from data.table, matrix-based computations and code profiling. The prevalence of large datasets makes it difficult to use parallel computing however, since the limit of available randomaccess memory (RAM) often implies a more serious bottleneck, which limits the possibility to manifold data sets for multiple cores.