PyEI: A Python package for ecological inference

An important question in some voting rights and redistricting litigation in the U.S. is whether and to what degree voting is racially polarized. In the setting of voting rights cases, there is a family of methods called “ecological inference” (see especially King, 1997) that uses observed data, pairing voting outcomes with demographic information for each precinct in a given polity, to infer voting patterns for each demographic group.

More generally, we can think of ecological inference as seeking to use knowledge about the margins of a set of tables (Table 1) to infer associations between the row and column variables, by making (typically probabilistic) assumptions. In the context of assessing racially polarized voting, a table like the one in Table 1 will correspond to a precinct, where each column corresponds to a candidate or voting outcome and each row to a racial group. Ecological inference methods then use the vote counts and demographic data for each precinct to make inferences about the overall voting preferences by demographic group, thus addressing questions like: "What percentage of East Asian voters voted for Hardy?" This example is an instance of what is referred to in the literature as "R by C" ecological inference, where here we have R = 2 groups and C = 3 voting outcomes. PyEI was created to support performing ecological inference with voting data; however, ecological inference methods also applicable in other fields, such as epidemiology (Elliot et al., 2000) and sociology (Goodman, 1953).

Statement of need
The results of ecological inference for inferring racially polarized voting are routinely used in US voting rights cases (King, 1997); therefore, easy to use and high quality tools for performing ecological inference are of practical interest. There is a need for an ecological inference library that brings together a variety of ecological inference methods in one place to facilitate crucial tasks such as: quantifying the uncertainty associated with ecological inference results under a given model; making comparisons between methods; and bringing relevant diagnostic tools to bear on ecological inference methods. To address this need, we introduce PyEI, a Python package for ecological inference.
PyEI is meant to be useful to two main groups of researchers. First, it serves applicationoriented researchers and practitioners who seek to run ecological inference on domain data (e.g., voting data), report the results, and understand the uncertainty related to those results. Second, it facilitates exploration and benchmarking for researchers who are seeking to understand properties of existing ecological inference methods in different settings and/or develop new statistical methods for ecological inference.
PyEI brings together the following ecological inference methods in a common framework alongside plotting, reporting, and diagnostic tools: • Goodman's ecological regression (Goodman, 1953) and a Bayesian linear regression variant • A truncated-normal based approach (King, 1997) • Binomial-Beta hierarchical models (King et al., 1999) • Dirichlet-Multinomial hierarchical models (Rosen et al., 2001) • A Bayesian hierarchical method for 2 × 2 EI following the approach of Wakefield (2004) In several of these cases, PyEI includes modifications to the models as originally proposed in the cited literature, such as reparametrizations or other changes to upper levels of the hierarchical models in order to ease sampling difficulties.
PyEI is intended to be easily extensible, so that additional methods from the literature can continue to be incorporated (for example, work is underway to add the method of James Greiner & Quinn (2009), currently implemented in the R package RxCEcolInf (Greiner et al., 2019)). Newly developed statistical methods for ecological inference can be included and conveniently compared with existing methods.
Several R libraries implementing different ecological inference methods exist, such as eiPack (Lau et al., 2020), RxCEcolInf (Greiner et al., 2019), ei (King & Roberts, 2016), and eiCompare (Collingwood et al., 2020). In addition to presenting a Python-based option that researchers who primarily use Python may appreciate, PyEI incorporates the following key features and characteristics.
First, the Bayesian hierarchical methods implemented in PyEI rest on modern probabilistic programming tooling (Salvatier et al., 2016) and gradient-based MCMC methods such as the No U-Turn Sampler (NUTS) (Betancourt, 2018;Hoffman & Gelman, 2014). Using NUTS where possible should allow for faster convergence than existing implementations that rest primarily on Metropolis-Hastings and Gibbs sampling steps. Consider effective sample size, which is a measure of how the variance of a Monte Carlo estimate of a posterior expectation computed from dependent samples compares to the variance of the corresponding estimate computed from independent samples from the posterior distribution (or, very roughly, how "effective" the samples are for estimating a posterior expectation, compared to independent samples) (Gelman et al., 2013). Under certain assumptions on the target posterior distribution, in Metropolis-Hastings the number of evaluations of the log-posterior required for a given effective sample size scales linearly with the dimensionality of the parameter space, while in Hamiltonian Monte Carlo approaches such as NUTS, the number of required evaluations of the gradient of the log-posterior scales only as the fourth root of the dimension (Neal, 2011). Reasonable scaling with the dimensionality of the parameter space is important in ecological inference, as that dimensionality is large when there are many precincts.
Second, integration with the existing tools PyMC3 (Salvatier et al., 2016) and ArviZ (Kumar et al., 2019) makes the results amenable to state of the art diagnostics (e.g. convergence diagnostics) and some reasonable checks are automatically performed.
Third, summary and plotting utilities for reporting, visualizing, and comparing results are included (e.g. Figure 1, Figure 2), with an emphasis on visualizations and reports that clarify the uncertainty of estimates under a model.
Lastly, clear documentation is provided, including a set of introductory and example notebooks.