BayesMFSurv: An R Package to Estimate Bayesian Split-Population Survival Models With (and Without) Misclassified Failure Events

Social Scientists and Biostatisticians often employ conventional parametric survival or mixture cure models (e.g., Weibull, Exponential) to analyze outcome variables in survival data that focus on the time until an event occurred or “failed” (Box-Steffensmeier & Zorn, 1999; Lee, Chakraborty, & Sun, 2017; Maller & Zhou, 1996). An important assumption underlying these models is that researchers record the date year, month or day in which an event or observation of interest failed (i.e., “terminated”) accurately. Yet events that are recorded as having failed at a given point in time can be inaccurately measured (Bagozzi, Joo, Kim, & Mukherjee, 2019; Clark, Bradburn, Love, & Altman, 2003; Schober & Vetter, 2018). Inaccurate measurement of this sort leads to a subset of misclassified failure cases in survival data in which some subjects are recorded as having failed or experienced the event of interest even though they in actuality “live on” past their recorded-failure point.


Summary
Social Scientists and Biostatisticians often employ conventional parametric survival or mixture cure models (e.g., Weibull, Exponential) to analyze outcome variables in survival data that focus on the time until an event occurred or "failed" (Box-Steffensmeier & Zorn, 1999;Lee, Chakraborty, & Sun, 2017;Maller & Zhou, 1996). An important assumption underlying these models is that researchers record the date -year, month or day -in which an event or observation of interest failed (i.e., "terminated") accurately. Yet events that are recorded as having failed at a given point in time can be inaccurately measured (Bagozzi, Joo, Kim, & Mukherjee, 2019;Clark, Bradburn, Love, & Altman, 2003;Schober & Vetter, 2018). Inaccurate measurement of this sort leads to a subset of misclassified failure cases in survival data in which some subjects are recorded as having failed or experienced the event of interest even though they in actuality "live on" past their recorded-failure point.
There are several scenarios where a subset of recorded failure events may persist beyond their recorded failure time, leading to misclassification in event failures. For example, political scientists who analyze the duration of civil wars fought between rebel groups and governments often record end dates ("failures") for specific conflicts based upon 24-month spells with fewer than 25 battle-deaths per year (Balch-Lindsay & Enterline, 2000;Thyne, 2012). The aforementioned threshold is prone to measurement error, especially for lower-intensity civil wars in poor information environments that persist beyond their recorded end date. Other examples include the study of the duration of ancient civilizations (Cioffi-Revilla & Landman, 1999) and the time taken to detect cancer (Schober & Vetter, 2018). In both these latter examples, researchers typically do not have data on the precise time-point of a given failure due to the sands of time or because of inaccurate information. This leads them to, similar to the civil conflict example, underestimate the duration of particularly misclassified event failure cases.
Since these underestimates of duration are non-random, bias will arise in survival estimates of the phenomena mentioned above when using conventional survival models. Hence, the main motivation for developing the Bayesian Misclassified Failure (MF hereafter) split population survival model is to resolve methodological challenges resulting from misclassified event failures by accounting for the possibility that some failure events survive beyond their recorded failure time. The development of the Bayesian MF model is also driven by the fact that it permits researchers to identify when the end date of observations in survival data is misclassified, therein providing substantive insights into this process. Further, there is no R package that extracts posterior distribution of estimates from parametric cure (split-population) models, including the MF model, using Bayesian Markov Chain Monte Carlo (MCMC) methods. Bay esMFSurv (Joo, Bejar, Schmidt, & Mukherjee, 2019) is an R package (R Core Team, 2019) that contains functions and computationally intensive routines in C++ to fit the parametric Weibull and Exponential (i) survival model and (ii) Misclassified Failure survival model via Bayesian MCMC methods using slice-sampling (Bagozzi et al., 2019;Neal, 2003).

Motivation, Description, Applications
Numerous R packages offer functionalities to estimate conventional parametric and semiparametric survival models via maximum likelihood estimation (MLE) or Bayesian MCMC methods (Diez, 2013;Therneau, 2019;Wang, Chen, Wang, & Yan, 2019;Zhou, Hanson, & Zhang, 2020). Other R packages focus on estimation of parametric or semi-parametric cure survival models using MLE (Amdahl, 2019;Beger, Hill, Metternich, Minhas, & Ward, 2017;Cai, Zou, Peng, & Zhang, 2012;Han, Zhang, & Shao, 2017). To our knowledge, there is no R package that fits parametric mixture cure models, including the MF survival model, via Bayesian MCMC (e.g., slice sampling) methods that offer a powerful yet flexible tool for estimating such models. Further, existing R packages that use Bayesian inference for survival analyses only focus on standard survival models that do not take into account latent misclassified failure events in survival data. Because misclassified failure events are right-censored events, there is a non-zero probability that these misclassified cases persisted beyond their recorded failure time. Failing to account for misclassified failure events in survival data that results from estimating standard survival or cure models will lead researchers to underestimate the duration of time of these events.
Since the underestimates of duration are non-random, bias will arise in survival estimates of these phenomena when researchers use standard survival or cure models. To address this misclassified failure challenge in survival data, our BayesMFSurv R package incorporates various functions listed below that fit Bagozzi et al. (2019)'s parametric MF survival model via Bayesian MCMC methods. This model estimates a system of two equations to account for the possibility that some unknown subset of failure events actually "lived on" beyond their recorded failure time. The first is a "splitting" equation that estimates the probability of a case being a misclassified failure, with or without covariates. The second equation is a standard parametric survival model, whose relevant failure and survival probabilities are estimated conditional on a case not being a misclassified failure. These features of the model in BayesMFSurv account for a heterogeneous mixture of failure cases in survival data and address the non-random underestimates of duration for misclassified failure events. BayesMFSurv also incorporates time-varying covariates that are common in panel survival datasets. This model can be applied at least to the following survival datasets where misclassified failure cases are prevalent: civil war termination that determines civil war duration (Thyne, 2012), time taken to detect onset of cancer (Schober & Vetter, 2018), and collapse (and thus duration) of ancient civilizations or political regimes (Cioffi-Revilla & Landman, 1999;Reenock, Bernhard, & Sobek, 2007).

BayesMFSurv R Package
The R package BayesMFSurv contains four functions to fit the parametric (Weibull and Exponential) (i) standard survival model and (ii) MF survival model via Bayesian MCMC using a slice-sampling algorithm described in Bagozzi et al. (2019). Bayesian MCMC estimation is conducted by using the Multivariate Normal prior for these models' split and survival stage parameters, and the Gamma prior for the shape parameter. The functions in BayesMFSurv are: • mfsurv: Fits a parametric MF model via Bayesian MCMC with slice-sampling to estimate the misclassification failure probability in the split (first) stage and hazard in the second (survival) stage. Slice-sampling, which is conducted by using the univariate slice sampler (Neal, 2003), is employed to draw the posterior sample of the model's split and survival stage parameters. • mcmcsurv: Fits a standard parametric survival model via Bayesian MCMC with slicesampling employed to draw the posterior sample of the model's survival stage parameters. • stats: Calculates log-likelihood and deviance information criterion (DIC) for fitted model objects of class mfsurv and mcmcsurv. • summary: Summarizes Bayesian MCMC output -this includes the mean, standard deviation, standard error of the mean, and 95% credible interval-of each parameter's posterior distribution from the Bayesian standard and MF parametric survival models.
The ease and speed of estimating the Bayesian standard and MF parametric survival models in BayesMFSurv is improved by using C++ to perform computationally intensive routines (e.g. slice-sampling) before pulling the output into R. Users can also illustrate trace-plots and kernel density plots for each parameter from mcmcsurv and mfsurv that fits the Bayesian standard and MF parametric models respectively. To illustrate the BayesMFSurv package's functionality, all the functions listed above have been tested on a survival dataset of democratic regime failure (Reenock et al., 2007) described and included in this package.

Availability
BayesMFSurv is an open source software made available under the MIT license. It can be installed from its GitHub repository using the remotes package: remotes::install_githu b("Nicolas-Schmidt/BayesMFSurv").