simstudy: Illuminating research methods through data generation

The simstudy package is a collection of functions for R (R Core Team, 2020) that allow users to generate simulated data sets in order to explore modeling techniques or better understand data generating processes. The user defines the distributions of individual variables, specifies relationships between covariates and outcomes


Summary
The simstudy package is a collection of functions for R (R Core Team, 2020) that allow users to generate simulated data sets in order to explore modeling techniques or better understand data generating processes. The user defines the distributions of individual variables, specifies relationships between covariates and outcomes, and generates data based on these specifications. The final data sets can represent randomized control trials, repeated measure designs, cluster-randomized trials, or naturally observed data processes. Many other complexities can be added, including survival data, correlated data, factorial study designs, step wedge designs, and missing data processes.
Simulation using simstudy has two fundamental steps. The user (1) defines the data elements of a data set and (2) generates the data based on these definitions. Additional functionality exists to simulate observed or randomized treatment assignment/exposures, to create longitudinal/panel data, to create multi-level/hierarchical data, to create datasets with correlated variables based on a specified covariance structure, to merge datasets, to create data sets with missing data, and to create non-linear relationships with underlying spline curves.
The overarching philosophy of simstudy is to create data generating processes that mimic the typical models used to fit those types of data. So, the parameterization of some of the data generating processes may not follow the standard parameterizations for the specific distributions. For example, in simstudy gamma-distributed data are generated based on the specification of a mean µ (or log(µ)) and a dispersion d, rather than shape α and rate β parameters that more typically characterize the gamma distribution. When we estimate the parameters, we are modeling µ (or some function of (µ)), so we should explicitly recover the simstudy parameters used to generate the model -illuminating the relationship between the underlying data generating processes and the models. For more details on the package, use cases, examples, and function reference see the documentation page.
simstudy is available on CRAN and can be installed with:

Statement of need
Empiricism and statistical analysis are cornerstones of scientific research but they can lead us astray if used incorrectly. Choosing the right methodology for the hypothesis and expected data is crucial for useful, valid results. Data simulated with simstudy under the assumptions derived from a hypothesis enables researchers to test and refine their analysis methodologies without the need for time-intensive, expensive pre-tests or collection of actual data. Additionally data generated with simstudy can be used in generalized, theoretical simulation studies to further the field of methodology.
There are several R-packages that allow for data generation under different assumptions. Most of these packages have a narrower scope that focuses on a specific class of data, like ICCbin (Hossain & Chakraborty, 2017), BinNonNor (Inan, Demirtas, & Gao, 2020) and genSurv (Meira-Machado & Faria, 2014). Some do not seem to be actively maintained (Alfons, Templ, & Filzmoser, 2010;Bien, 2016;Chan, 2014;Hofert & Mächler, 2016), which can cause compatibility issues. Some target specific fields of study and their needs, like the psychologyfocused psych package (Revelle, 2020) or the conjurer package (Macherla, 2020) that provides methods to generate synthetic customer data for industry use. simstudy is unique with its philosophy of data generating processes that mimic the models used in analysis and allowing for the possibility of generating a wide range of complex data through these processes. The SimDesign Package (Chalmers & Adkins, 2020) and the related MonteCarlo Package (Leschinski, 2019) follow a similar line of thought but focus on easy replication of the analyses and providing summaries of simulated data.