mixR: An R package for Finite Mixture Modeling for Both Raw and Binned Data

R (R Core Team, 2020) provides a rich collection of packages for building and analyzing finite mixture models, which are widely used in unsupervised learning, such as model-based clustering and density estimation. For example, mclust (Scrucca et al., 2016) can be used to build Gaussian mixture models with different covariance structures, mixtools (Benaglia et al., 2010) implements parametric and non-parametric mixture models as well as mixtures of Gaussian regressions, flexmix (Leisch, 2004) provides a general framework for finite mixtures of regression models, mixdist (Macdonald et al., 2018) fits mixture models for grouped and conditional data (also called binned data). To our knowledge, almost all R packages for finite mixture models are designed to use raw data as the modeling input except mixdist. However, the popular model selection methods based on information criteria or bootstrapping likelihood ratio test (bLRT) (Feng & McCulloch, 1996; McLachlan, 1987; Yu & Harvill, 2019) are not implemented in mixdist. To bridge this gap and to unify the interface for finite mixture modeling for both raw and binned data, we implement mixR package that provides the following primary features.


Statement of need
R (R Core Team, 2020) provides a rich collection of packages for building and analyzing finite mixture models, which are widely used in unsupervised learning, such as model-based clustering and density estimation. For example, mclust (Scrucca et al., 2016) can be used to build Gaussian mixture models with different covariance structures, mixtools (Benaglia et al., 2010) implements parametric and non-parametric mixture models as well as mixtures of Gaussian regressions, flexmix (Leisch, 2004) provides a general framework for finite mixtures of regression models, mixdist (Macdonald et al., 2018) fits mixture models for grouped and conditional data (also called binned data). To our knowledge, almost all R packages for finite mixture models are designed to use raw data as the modeling input except mixdist. However, the popular model selection methods based on information criteria or bootstrapping likelihood ratio test (bLRT) (Feng & McCulloch, 1996;McLachlan, 1987;Yu & Harvill, 2019) are not implemented in mixdist. To bridge this gap and to unify the interface for finite mixture modeling for both raw and binned data, we implement mixR package that provides the following primary features.
• select() selects the best model from a series of mixture models with a different number of mixture components by using Bayesian Information Criterion (BIC).
• bs.test() performs bLRT for two mixture models from the same distribution family but with a different number of components.
mixR also contains the following additional features.
• Functions to generate random data from mixture models.
• Functions to convert parameters of Weibull and Gamma mixture models between shapescale representation used in probability density functions and mean-variance representation which is more intuitive for people to understand the distribution.

Examples
We demonstrate how to use mixR for fitting finite mixture models and selecting mixture models using BIC and bLRT.

Model fitting
We fit the following four mixture models to a data set that consists of 1000 random data points generated from a Weibull mixture model with two components.
• Gaussian mixture with two components (mod1) • Gaussian mixture with two components to the binned data (mod2) • Gaussian mixture with three components (mod3) • Weibull mixture with two components (mod4) The fitted coefficients in mod1 and mod2 and the top two plots in Figure 1 show that binning does not cause much information loss, and we get similar fitted results using either raw data or binned data. This is usually the case when we have at least moderate data size, and the underlying mixture model is not too complex (e.g., too many mixture components). A benefit of binning is that it reduces the computation burden significantly for large data, especially when conducting bLRT, which is computationally intensive. From Figure 1 we also observe that Gaussian mixture models can provide a good fit for non-Gaussian data though the number of mixture components tends to be overestimated because more Gaussian components are needed to model the asymmetry and long tails that usually exist in non-Gaussian data.