latentcor: An R Package for estimating latent correlations from mixed data types

We present `latentcor`, an R package for correlation estimation from data with mixed variable types. Mixed variables types, including continuous, binary, ordinal, zero-inflated, or truncated data are routinely collected in many areas of science. Accurate estimation of correlations among such variables is often the first critical step in statistical analysis workflows. Pearson correlation as the default choice is not well suited for mixed data types as the underlying normality assumption is violated. The concept of semi-parametric latent Gaussian copula models, on the other hand, provides a unifying way to estimate correlations between mixed data types. The R package `latentcor` comprises a comprehensive list of these models, enabling the estimation of correlations between any of continuous/binary/ternary/zero-inflated (truncated) variable types. The underlying implementation takes advantage of a fast multi-linear interpolation scheme with an efficient choice of interpolation grid points, thus giving the package a small memory footprint without compromising estimation accuracy. This makes latent correlation estimation readily available for modern high-throughput data analysis.


License
Authors of papers retain copyright and release the work under a Creative Commons Attribution 4.0 International License (CC BY 4.0).

Summary
We present latentcor, an R package for correlation estimation from data with mixed variable types. Mixed variables types, including continuous, binary, ordinal, zero-inflated, or truncated data are routinely collected in many areas of science. Accurate estimation of correlations among such variables is often the first critical step in statistical analysis workflows. Pearson correlation as the default choice is not well suited for mixed data types as the underlying normality assumption is violated. The concept of semi-parametric latent Gaussian copula models, on the other hand, provides a unifying way to estimate correlations between mixed data types. The R package latentcor comprises a comprehensive list of these models, enabling the estimation of correlations between any of continuous/binary/ternary/zero-inflated (truncated) variable types. The underlying implementation takes advantage of a fast multi-linear interpolation scheme with an efficient choice of interpolation grid points, thus giving the package a small memory footprint without compromising estimation accuracy. This makes latent correlation estimation readily available for modern high-throughput data analysis.

Statement of need
No R software package is currently available that allows accurate and fast correlation estimation from mixed variable data in a unifying manner. The popular cor function within R package stats (Team and others 2013), for instance, allows to compute Pearson's correlation, Kendall's τ and Spearman's ρ, and a faster algorithm for calculating Kendall's τ is implemented in the R package pcaPP (Croux, Filzmoser, and Fritz 2013). Pearson's correlation is not appropriate for skewed or ordinal data, and its use leads to invalid inference in those cases. While the rank-based Kendall's τ and Spearman's ρ are more robust measures of association, the resulting values do not have correlation interpretation and can not be used as direct substitutes in statistical methods that require correlation as input (e.g., graphical model estimation (Yoon, Gaynanova, and Müller 2019)). The R package polycor (Fox 2019) is designed for ordinal data and allows to computes polychoric (ordinal/ordinal) and polyserial (ordinal/continuous) correlations based on latent Gaussian model. However, the package does not have functionality for zero-inflated data, nor can it handle skewed continuous measurements as it does not allow for copula transformation. The R package correlation (Makowski et al. 2020) in the easystats collection provides 16 different correlation measures, including polychoric and polyserial correlations. However, functionality for correlation estimation from zero-inflated data is lacking. The R package mi xedCCA (Yoon, Carroll, and Gaynanova 2020) is based on the latent Gaussian copula model and can compute latent correlations between continuous/binary/zero-inflated variable types as an intermediate step for canonical correlation analysis. However, mixedCCA does not allow for ordinal data types. The R package latentcor, introduced here, thus represents the first stand-alone R package for computation of latent correlation that takes into account all variable types (continuous/binary/ordinal/zero-inflated), comes with an optimized memory footprint, and is computationally efficient, essentially making latent correlation estimation almost as fast as rank-based correlation estimation.

Estimation of latent correlations
The general estimation workflow The estimation of latent correlations consists of three steps: • computing Kendall's τ between each pair of variables, • choosing the bridge function F (·) based on the types of variable pairs; the bridge function connects the Kendall's τ computed from the data, τ , to the true underlying correlation ρ via moment equation E( τ ) = F (ρ); • estimating latent correlation by calculating F −1 ( τ ).
We summarize the references for the explicit form of F (·) for each variable combination as implemented in latentcor below.

Efficient inversion of the bridge function
In latentcor, the inversion of the bridge function F (·) can be computed in two ways. The original approach (method = "original") relies on numerical inversion for each pair of variables based on uni-root optimization (Yoon, Carroll, and Gaynanova 2020). Since each pair of variables requires a separate optimization run, the original approach is computationally expensive when the number of variables is large. The second approach to invert F (·) is through fast multi-linear interpolation of pre-calculated F −1 values at specific sets of interpolation grid points (method = "approx"). This construction has been proposed in (Yoon, Müller, and Gaynanova 2021) and is available for continuous/binary/truncated pairs in the current version of mixedCCA. However, that implementation lacks the ternary variable case and relies on an interpolation grid with a large memory footprint. latentcor includes the ternary case and provides an optimized interpolation grid by redefining the bridge functions on a rescaled version of Kendall's τ . Here, the scaling adapts to the smoothness of the underlying type of variables by simultaneously controlling the approximation error at the same or lower level. As a result, latentcor has significantly smaller memory footprint (see Table below) and smaller approximation error compared to mixedCCA.

Illustrative example
To illustrate the excellent performance of latent correlation estimation on mixed data, we consider the simple example of estimating correlations between continuous and ternary variables. In this synthetic scenario, we have access to the true underlying correlation between the variables. Figure 1A displays the values obtained by using standard Pearson correlation, revealing a significant estimation bias with respect to the true correlations. Figure 1B displays the estimated latent correlations using the original approach versus the true values of underlying ternary/continuous correlations. The alignment of points around y = x line confirms that the estimation is empirically unbiased. Figure 1C displays the estimated latent correlations using the approximation approach (method = "approx") versus true values of underlying latent correlation. The results are almost indistinguishable from Figure 1B at a fraction of the computational cost. The script to reproduce the displayed results is available at latentcor_evaluation.

Basic Usage
We provide two basic code examples of how to use latentcor in R.
The first example illustrates how to estimate latent correlation from pairs of ternary/continuous variables.

Availability
The R package latentcor is available on Github. A comprehensive vignette with additional mathematical and computational details is available here.