IDCeMPy: Python Package for Inflated Discrete Choice Models

Scholars and data scientists often use discrete choice models to evaluate ordered dependent variables using the ordered probit model and unordered polytomous outcome measures via the multinomial logit (MNL) estimator (Greene, 2002; Richards & Bonnet, 2018; Sarrias, 2016). These models, however, cannot account for the possibility that in many ordered and unordered polytomous choice outcomes, a disproportionate share of observations — stemming from two distinct data generating processes (DGPs) — fall into a single category which is thus “inflated.” For instance, ordered outcome measures of self-reported smoking behavior that range from 0 for “no smoking” to 3 for “smoking 20 cigarettes or more daily” contain excessive observations in the zero (no smoking) category that includes individuals who never smoke cigarettes and those who smoked previously but temporarily stop smoking because of an increase in cigarette costs (Greene et al., 2015; Harris & Zhao, 2007). The “indifference” middle-category in ordered measures of immigration attitudes is inflated since it includes respondents who are genuinely indifferent about immigration and those who select “indifference” because of social desirability reasons (Bagozzi & Mukherjee, 2012; Brown et al., 2020). The baseline category of unordered polytomous variables of presidential vote choice is also often inflated as it includes non-voters who abstain from voting owing to temporary factors and routine non-voters who are disengaged from the political process (Bagozzi & Marchetti, 2017; Campbell & Monson, 2008). Inflated discrete choice models have been developed to address such category inflation in ordered and unordered polytomous outcome variables as failing to do so leads to model misspecification and incorrect inferences (Bagozzi & Mukherjee, 2012; Brown et al., 2020; Harris & Zhao, 2007).


Summary
Scholars and data scientists often use discrete choice models to evaluate ordered dependent variables using the ordered probit model and unordered polytomous outcome measures via the multinomial logit (MNL) estimator (Greene, 2002;Richards & Bonnet, 2018;Sarrias, 2016). These models, however, cannot account for the possibility that in many ordered and unordered polytomous choice outcomes, a disproportionate share of observations -stemming from two distinct data generating processes (DGPs) -fall into a single category which is thus "inflated." For instance, ordered outcome measures of self-reported smoking behavior that range from 0 for "no smoking" to 3 for "smoking 20 cigarettes or more daily" contain excessive observations in the zero (no smoking) category that includes individuals who never smoke cigarettes and those who smoked previously but temporarily stop smoking because of an increase in cigarette costs (Greene et al., 2015;Harris & Zhao, 2007). The "indifference" middle-category in ordered measures of immigration attitudes is inflated since it includes respondents who are genuinely indifferent about immigration and those who select "indifference" because of social desirability reasons (Bagozzi & Mukherjee, 2012;Brown et al., 2020). The baseline category of unordered polytomous variables of presidential vote choice is also often inflated as it includes non-voters who abstain from voting owing to temporary factors and routine non-voters who are disengaged from the political process (Bagozzi & Marchetti, 2017;Campbell & Monson, 2008). Inflated discrete choice models have been developed to address such category inflation in ordered and unordered polytomous outcome variables as failing to do so leads to model misspecification and incorrect inferences (Bagozzi & Mukherjee, 2012;Brown et al., 2020;Harris & Zhao, 2007).
IDCeMPy is an open-source Python package that enables researchers to fit three distinct sets of discrete choice models used by data scientists, economists, engineers, political scientists, and public health researchers: the Zero-Inflated Ordered Probit (ZiOP) model without and with correlated errors (ZiOPC model), Middle-Inflated Ordered Probit (MiOP) model without and with correlated errors (MiOPC), and Generalized-Inflated Multinomial Logit (GiMNL) models. Functions that fit the ZiOP(C) model in IDCeMPy evaluate zero-inflated ordered dependent variables that result from two DGPs, while functions that fit the MiOP(C) models account for inflated middle-category ordered outcomes that emerge from distinct DGPs. The functions in IDCeMPy that fit GiMNL models account for the large share and heterogeneous mixture of observations in the baseline and other lower outcome categories in unordered polytomous dependent variables. The primary location for the description of the functions that fit the models listed above is available at the IDCeMPy package's documentation website.

State of the Field
Software packages and code are available for estimating standard (non-inflated) discrete choice models. In the R environment, the MASS (Venables & Ripley, 2002) and micEcon (Henningsen, 2014) packages fit binary and discrete choice models. The Rchoice (Sarrias, 2016) package allows researchers to estimate binary and ordered probit and logit models as well as the Poisson model by employing various optimization routines. The proprietary LIMDEP package NLOGIT (Greene, 2002) fits conventional binary and ordered discrete choice models but is neither open-sourced nor freely available. The R mlogit (Croissant, 2012) and mnlogit (Hasan et al., 2016) packages provide tools for working with conventional MNL models, while gmnl (Sarrias et al., 2017) and PReMiuM (Liverani et al., 2015) estimate MNL models that incorporate unit-specific heterogeneity. There are proprietary LIMDEP software and R code -but not an R package -that fit few inflated ordered probit and MNL models (Bagozzi & Marchetti, 2017;Bagozzi & Mukherjee, 2012;Harris & Zhao, 2007). Outside R, the Python biogeme (Bierlaire, 2016)  The R or LIMDEP software, along with the STATA commands listed above, are undoubtedly helpful. However, to our knowledge, there are no R or Python packages to fit a variety of statistical models that account for the excessive (i.e., "inflated") share of observations in the baseline, and other higher categories of ordered and unordered polytomous dependent variables, which are commonly analyzed across the natural and social sciences. As discussed below, our Python package IDCeMPy thus fills an important lacuna by providing an array of functions that fit a substantial range of inflated discrete choice models applicable across various disciplines.

Statement of Need
Although our IDCeMPy package also fits standard discrete choice models, what makes it unique is that unlike existing software, it offers functions to fit and assess the performance of both Zero-Inflated and Middle-Inflated Ordered Probit (OP) models without and with correlated errors as well as a set of Generalized-Inflated MNL models. The models included in IDCeMPy account for the excessive proportion of observations in any given ordered or unordered outcome category by combining a single binary probit or logit split-stage equation with either an ordered probit outcome stage (for the Zero and Middle-Inflated OP models) or an MNL outcomestage equation. Users can treat the error terms from the two equations in the Zero and Middle-Inflated OP models as independent or correlated in the package's estimation routines. IDCeMPy also provides functions to assess each included model's goodness-of-fit via the AIC statistics, extract the covariates' marginal effects from each model, and conduct Vuong tests for comparing the performance between the standard and inflated discrete choice models.
The functions in IDCeMPy use quasi-Newton optimization methods such as the Broyden-Fletcher-Goldfarb-Shanno algorithm for Maximum-Likelihood-Estimation (MLE), which facilitates convergence and estimation speed. Another feature is that the coefficients, standard errors, and confidence intervals obtained for each model estimated in IDCeMPy are in pandas. DataFrame (McKinney, 2010) format and are stored as class attribute .coefs. This allows for easy export to CSV or Excel, which makes it easier for users to perform diagnostic tests and extract marginal effects. IDCeMPy is thus essential as it provides a much-needed unified software package to fit statistical models to account for category inflation in several ordered and unordered outcome variables used across fields as diverse as economics, engineer-ing, marketing, political science, public health, sociology, and transportation research. Users can employ the wide range of statistical models in IDCeMPy to assess: • Zero-inflation in self-reported smoking behavior (Harris & Zhao, 2007), demand for health treatment (Greene et al., 2015), and accident injury-severity (Fountas et al., 2018).

Functionality and Applications
IDCeMPy contains the functions listed below to estimate via MLE the following inflated discrete choice models listed earlier: • opmod; iopmod; iopcmod: Fits the ordered probit model, the Zero-Inflated (ZIOP) and Middle-Inflated ordered probit (MIOP) models without correlated errors, and the ZIOPC and MIOPC models that incorporate correlated errors.
• vuong_opiop; vuong_opiopc: Calculates Vuong test statistic for comparing the performance of the OP with the ZiOP(C) and MiOP(C) models.
• split_effects; ordered_effects: Estimates marginal effects of covariates in the split-stage and outcome-stage respectively.
Details about the functionality summarized above are available at the package's documentation website, which is open-source and hosted by ReadTheDocs. The features of the functions in IDCeMPy that fit the (i) ZiOP(C) models are presented using the ordered self-reported tobacco consumption dependent variable from the 2018 National Youth Tobacco Dataset (ii) MiOP(C) models are illustrated using the ordered EU support outcome variable from Elgün & Tillman (2007) (iii) GiMNL models are evaluated using the unordered polytomous Presidential vote choice dependent variable from Campbell & Monson (2008)

Availability and Installation
IDCeMPy is open-source software made available under the GNU General Public License. It can be installed from PyPI or from its GitHub repository.