datawizard: an R package for easy data preparation and statistical transformations

The {datawizard} package for the R programming language (R Core Team, 2021) provides a lightweight toolbox to assist in key steps involved in any data analysis workflow: (1) wrangling the raw data to get it in the needed form, (2) applying preprocessing steps and statistical transformations, and (3) compute statistical summaries of data properties and distributions. Therefore, it can be a valuable tool for R users and developers looking for a lightweight option for data preparation.


Statement of Need
The {datawizard} package is part of {easystats}, a collection of R packages designed to make statistical analysis easier , Lüdecke et al. (2020), , , Lüdecke et al. (2019), , Makowski et al. (2020)). As this ecosystem follows a "0-external-hard-dependency" policy, a data manipulation package that relies only on base R needed to be created. In effect, {datawizard} provides a data processing backend for this entire ecosystem. In addition to its usefulness to the {easystats} ecosystem, it also provides an option for R users and package developers if they wish to keep their (recursive) dependency weight to a minimum (for other options, see Dowle & Srinivasan (2021), Eastwood (2021)).
Because {datawizard} is also meant to be used and adopted easily by a wide range of users, its workflow and syntax are designed to be similar to {tidyverse} (Wickham et al., 2019), a widely used ecosystem of R packages. Thus, users familiar with the {tidyverse} can easily translate their knowledge and make full use of {datawizard}.
In addition to being a lightweight solution to clean messy data, {datawizard} also provides helpers for the other important step of data analysis: applying statistical transformations to the cleaned data while setting up statistical models. This includes various types of data standardization, normalization, rank-transformation, and adjustment. These transformations, although widely used, are not currently collectively implemented in a package in the R ecosystem, so {datawizard} can help new R users in finding the transformation they need.
Lastly, {datawizard} also provides a toolbox to create detailed summaries of data properties and distributions (e.g., tables of descriptive statistics for each variable). This is a common step in data analysis, but it is not available in base R or many modeling packages, so its inclusion makes {datawizard} a one-stop-shop for data preparation tasks. * Brenton Wiernik is currently an independent researcher and Research Scientist at Meta, Demography and Survey Science. The current work was done in an independent capacity.

Features Data Preparation
The raw data is rarely in a state that it can be directly fed into a statistical model. It often needs to be modified in various ways. For example, columns need to be renamed or reshaped, certain portions of the data need to be filtered out, data scattered across multiple tables needs to be joined, etc.
{datawizard} provides various functions for cleaning and preparing data (see Table 1). to rename variables data_to_long() to convert data from wide to long data_to_wide() to convert data from long to wide data_join() to join two data frames … … We will look at one example function that converts data in wide format to tidy/long format:

Statistical Transformations
Even after getting the raw data in the needed format, we may need to transform certain variables further to meet requirements imposed by a statistical test.

Summaries of Data Properties and Distributions
The workhorse function to get a comprehensive summary of data properties is describe_distribution(), which combines a set of indices (e.g., measures of centrality, dispersion, range, skewness, kurtosis, etc.) computed by other functions in