samplics: a Python Package for selecting, weighting and analyzing data from complex sampling designs

Survey sampling techniques are used in various fields to obtain information about a large population by studying a fraction of its elements. As a result, it helps produce a significant portion of the official statistics by national governments and international organizations. For example, the Demographic and Health Survey (DHS) and the Multiple Cluster Indicator Survey (MICS) have been collecting demographic and health indicators for more than 35 years and 25 years respectively in over 100 countries. DHS and MICS are two of the primary sources of data for tracking the progress towards achieving the Sustainable Development Goals (SDGs). Similarly, numerous political and socio-economic branches of society rely on survey sampling to estimate the characteristics of populations of interest.


Statement of Need
samplics aims at providing a comprehensive statistical package for analyzing survey sample data. The primary target audiences are survey statisticians and other data analysts working with sample data obtained from complex survey designs. Data specialists can use this package to produce, analyze, and use official statistics. It also can help teach statistical concepts given the wide use of Python in Education and the simplicity of the samplics APIs.
When designing a survey, samplics can calculate sample sizes by stratum based on expected proportions and level of precision for the indicator of interest as well as measures of the complexity of the design such as survey design effects. After sample sizes are determined, samp lics can calculate selection probabilities according to the sample selection strategy. To ensure the representativeness of the random sample, samplics will compute design weights and adjust them for non-response, post-stratification, and calibration. samplics provides Taylorbased and replication-based techniques for calculating population parameter estimates and associated measures of uncertainty. Finally, samplics has a small area estimation subpackage that can predict small area parameters. Note that samplics can be used independently for any of the steps described in the paragraph.
The sections below provide more details on the survey sampling techniques implemented in samplics.

Survey Sampling Techniques
In large-scale surveys, often complex random mechanisms are used to select samples. Estimations obtained from such samples must reflect the complexity of the random mechanism to ensure correct approximations of the population parameters by sample estimates (Cochran, 1977), (Kish, 1965), and (Lohr, 2010). For Python users, samplics provides a comprehensive ecosystem of survey sampling techniques.

Sample Selection
The sample selection mechanism, a fundamental aspect of survey sampling, guides the statistical techniques employed to ensure the representativeness of the sample. In samplics, the focus is on random sampling techniques where units in the target population have a known probability of inclusion in the sample. let us assume that the target population has N units, and let us note π i , the probability of unit i to be included in the sample. That is P (I i = 1) = π i , where I i indicates whether unit i was included or not in the sample. The sample selection techniques implemented in samplics can be viewed as the combination of three key concepts: selection probability, stratification, and clustering. SRS results in an equal probability of selection for all sampling units, P (I i = 1) = π. Stratification is a technique that consists of dividing the target population into m partitions, and sample selection is performed independently in each partition called stratum. Stratification is commonly used to divide the population, hence the sample, into homogeneous groups, e.g., income class, gender, ethnic group, etc. But it can also be used to control sample sizes by stratum; for example, governments often use stratification to ensure proper coverage of geographical administrative entities in the sample. Clustering is useful when a sample frame is unavailable for the units of interest or the operational cost of directly selecting the units and collecting the data is too high. In cluster sampling design, units of interest are grouped into clusters, and a sample of clusters is selected first (one-stage cluster sampling). Clustering can be done at multiple levels resulting in two-stage (or more) cluster sampling designs. Probability proportional to size (PPS) methods, e.g., Systematic, Brewer's method, Hanurav-Vijayan method, Murphy's method, and Rao-Sampford's method, are commonly used to select the clusters (Brewer & Hanif, 1983). Generally, cluster sampling leads to unequal probabilities of inclusion of sample units.

Sample Weighting
Sample weighting is the main mechanism used in surveys to formalize the representativeness of the sample. In complex surveys, sample weighting is composed of two main steps. First, the design (or base) weights are calculated as the inverse of the selection probabilities. Let us us assume that π i is the final selection probability for unit i in the sample. Hence, d i = 1 πi , where d i is the design weight associated with unit i and can be interpreted as the average number of units in the target population represented by i including itself. Second, the design weights are adjusted to compensate for distortions due to shortcomings of the sample design implementation. Often, the initial weight adjustment corrects for nonresponse. This adjustment consists of defining response classes, then inflating the sample weights within response classes to compensate for the loss of sampled units due to nonresponse. In complex surveys, it is common to perform multiple sample weight adjustments. Hence, within a response class, the adjusted sample weights can be obtained as follows: where a k is the adjustment factor for step k. When reliable auxiliary information is available at the population level, poststratification and calibration can be used to adjust the sample weights. samplics also computes replicate weights, i.e., balanced repeated replication (BRR), bootstrap, and jackknife, often used to estimate complex parameters such as quantiles. (Valliant & Dever, 2018) provides a step-by-step guide for calculating sample weights for complex sampling designs.

Sample Estimation
As mentioned above, estimation of population parameters e.g., total, mean, median, coefficient of correlation, regression coefficients, etc., is one of the main objectives of surveys sampling. The sample weight is the primary mechanism for generalizing the sample estimate to approximate the equivalent population parameter. let us us us us consider the population parameter, total, defined as where H is the number of strata, N h is the number of primary sampling units (PSUs) in the population from stratum h and M hj is the number of units from PSU i in stratum h. It follows that the sample estimate of the total is defined asŶ where n h is the number of PSUs in the sample from stratum h and m hj is the number of units in the sample from PSU i in stratum h. I hij denotes the inclusion status of unit hij to the sample i.e., I hij = 1 if unit hij is included in the sample otherwise I hij = 0. The uncertainty estimation of the sample estimate must reflect the sampling mechanism and the weight adjustments. samplics provides two main frameworks for computing uncertainties, linearization (Taylor series) and replication.
Using the Taylor series method, the variance of the total is estimated aŝ where y hi. = ∑ m hi j=1 w hij y hij ,ȳ h.. = ∑ n h i=1 y hi. /n h , and f h is the sampling rate for the first stage from stratum h. The formula can be extended to the two-stage sampling design where second stage clusters or secondary sampling units (SSUs) are randomly selected from the PSUs prior to the selection of final sample units within selected SSUs. Under the two-stage sampling design, the Taylor series variance estimate of the total iŝ j=1 y hij. /m hi , and f hi is the sampling rate for the second stage of sampling from PSU i in stratum h. The variance estimation of the total can be extended to other population parameters that are functions of the sample weight. For example, the variance estimates of the mean and ratio are obtained by replacing y hijk by (y hijk −Ŷ )/Ŵ and ∑ k=1 x hijk andR =Ŷ /X. Furthermore, the variance estimators in this section are extensible to domain analysis.
Suppose that θ is the population parameter of interest. Under the replication framework, multiple replicates, say R, of the sample are drawn following a given selection scheme (Efron & Tibshirani, 1994) and (Wolter, 2007). For each replicate, a set of replicate weights is constructed by multiplying the sample weights by an adjustement factor a hi . The resulting weights, called the replicate weights, are then used to obtain the R replicate estimates of the population parameter i.e.θ (r) , r = 1, ..., R. The estimate of the variance ofθ is then given bŷ r) . Both c r and a hi are specific to the replication method. For Bootstrap, we have c r = 1/R and a hi = n h n h −1 m * hi , where m * hi is the number of times PSU hi was resampled. The replication factor c r is the same across the strata, however the weight adjustment factors a hi are stratum specific.
For balanced repeated replication (BRR) with Fay, we have where Hd is the Hadarmard matrix. f = 0 corresponds to the default BRR method without the Fay adjustment. A Hadamard matrix is a square matrix whose entries are either +1 or −1 and whose rows are mutually orthogonal. In the case of BRR-Fay, both the replication factor c r and the weight adjustment factor a hi are constant across the strata.
For Jackknife (delete-one), we have This formula is easily generalizable to the non stratified design (H = 1) by replacing n h ′ by n and dropping the case h ′ ̸ = h. The replication factor c r is stratum specific in the case of Jackknife, which allows a finite-population correction by stratum.

Small Area Estimation (SAE)
When the sample size is insufficient to produce reliable/stable domain level estimates, SAE techniques model the output variable of interest and produce domain level estimates. These domains are referred to as small areas. For the most part, the SAE models are applications of mixed models, see (McCulloch et al., 2008) and (Rao & Molina, 2015) for more details on mixed models. Mixed models allow accounting for the between-area variations by using random area-specific effects and the auxiliary variables contribution through the fixed effects. Small Area Estimation models are generally classified into two classes: the Area-level and the Unit-level models (Rao & Molina, 2015).

Area-level Model
As mentioned above, the Areal-level approach models the variables of interest using known auxiliary information at some aggregated level(s treated as known for the purpose of deriving the estimates. Under the basic Area-level model, the best predictor (best in the sense of minimizing the mean squared error)) of θ iŝ andβ is the best linear unbiased estimator of β. The empirical best (EB), or empirical Bayes, predictorθ EB d is obtained by replacing the unknown parameter in the expression ofθ B d by their estimators. The parameters of the model, β and σ 2 u are estimated using method of moment (MOM), maximum likelihood (ML), restricted maximum likelihood (REML), or other suitable techniques. The EB estimator is a weighted average of the survey (direct) estimatorθ d and the regression predictor x T dβ where the weight isB d = ψ d /(σ 2 u +ψ d ).

Unit-level model
The Unit-level models fit the data at the atomic individual unit level. The basic Unit-level model can be formally defined as follows: ..., N d , d = 1, ..., m, where u d iid ∼ N (0, σ 2 u ) and e dj iid ∼ N (0, σ 2 e ) are independent random normal variables, x dj is the vector of auxiliary variables, d designates the small area and j designates the the unit within the small-area d. The best linear unbiased predictor (BLUP) estimator of the small where γ d = σ 2 u σ 2 e +n d σ 2 u , the estimatorβ is the best linear unbiased estimator of β, and n d is the sample size for small area d. The empirical best linear predictor,θ EB d , is obtained by replacing the model parameters by their estimators in the expression ofθ B d . (Elbers et al., 2003) extends the basic Unit-level model by relaxing the normal distribution of the errors with an empirical semi-parametric model. This model has been used by the World Bank to estimate small area poverty indices. Furthermore, (Molina & Rao, 2010) provide a parametric approach for estimating complex small area parameters such as poverty indices.