geostan : An R package for Bayesian spatial analysis

Analyses of data collected across areal units, such as census tracts and states, are now ubiquitous in the social and health sciences. Data sources include surveys (especially large government-backed surveys like the US Census Bureau’s American Community Survey (ACS)), vital statistics systems, and disease registries (particularly cancer registries). These data sources can provide crucial information about population health and socio-economic outcomes, but many standard (non-spatial) statistical methods and workflows are either not applicable to spatial data or they require adjustment (Cressie, 2015; Haining & Li, 2020).


Statement of need
The distinguishing characteristic of spatial data is that maps of the data typically contain moderate to strong spatial patterns, or spatial autocorrelation, which typically reduces effective sample size (ESS) and renders many standard statistical methods inappropriate (Clifford et al., 1989;"Student" [W.S. Gausset], 1914). In addition, spatial patterns are often of direct interest-for example, disease mapping studies are concerned primarily with understanding how disease or mortality risk vary over space.
A major challenge for spatial analysis is data quality, particularly for researchers using surveybased covariates. A single spatial analysis may use dozens, or even thousands, of error-laden survey estimates. Sampling error in ACS estimates is often substantial in magnitude and socially patterned Folch et al., 2016), which can have real consequences on communities and service providers (Bazuin & Fraser, 2013). Spatial ME models are required to avoid ME biases and unwarranted levels of confidence in results.
geostan fills two gaps in this software landscape. First, geostan offers spatial ME models that are appropriate for survey-based covariates. Second, geostan provides spatial model diagnostic functions that make it easy for users to evaluate model results even if they are unfamiliar with MCMC analysis. Functionality geostan provides tools for spatial data visualization, construction of spatial weights matrices, spatial ME models, models for censored count data, and multiple types of spatial statistical models for continuous and discrete data types. The shape2mat function creates spatial weights matrices by first calling the spdep package (R. S. Bivand et al., 2013) to identify the adjacency structure of the spatial data, and results are returned to the user in sparse matrix format using the Matrix package (Bates et al., 2022).
geostan uses MCMC for inference, which allows users to conduct formal inference on generated quantities of interest. The models are built using the Stan modeling language, a state-of-the-art platform for MCMC sampling (Gabry et al., 2020;Stan Development Team, 2022a, 2022b, but users only need to be familiar with the standard R formula interface. Because geostan returns stanfit objects from rstan, it is compatible with the rstan ecosystem of packages including shinystan for visual summaries of model parameters and MCMC diagnostics (Gabry, 2018), tidybayes for working with MCMC samples (Kay, 2022), and bridgesampling for model comparison using Bayes factors (Gronau et al., 2020).

Exploratory spatial data analysis (ESDA)
The package provides convenience functions for visualizing spatial patterns and conducting ESDA, including • Moran scatter plot for visualizing spatial autocorrelation (Chun & Griffith, 2013) • Moran coefficient and Geary Ratio for measuring global spatial autocorrelation (Chun & Griffith, 2013) • Local Moran's I and local Geary's C for measuring and visualizing local spatial autocorrelation (Anselin, 1995) • The Approximate Profile Likelihood (APLE) estimator for measuring spatial autocorrelation (Li et al., 2007) • Effective sample size (ESS) calculation (D. A. Griffith, 2005) These tools are provided for exploratory analysis, but not for detection of clusters. For this and other reasons, p-values are not provided. Graphics are created with ggplot2 (Wickham, 2016).

Spatial models
All of the models allow for a set of exchangeable 'random effects' to be added, and spatially lagged covariates (SLX) can also be added to any of the models. While proper CAR models have been avoided in the past due to their computational burden, the CAR model is the most efficient spatial model in geostan. It is fast enough to work interactively on a laptop with more than 3000 observations, such as U.S. county data (Donegan, 2021).
A set of functions for working with model results conveniently extracts fitted values, marginal effects, residuals, spatial trends, and posterior (or prior) predictive distributions. Users are encouraged to always undertake a thoughtful spatial analysis of model residuals and other quantities to critique and improve their models through successive rounds of ESDA (cf. Gabry et al., 2019).
Spatial ME models ME models can be added to any geostan model. These are models for covariates measured with error, particularly small-area survey estimates with standard errors. The ME models treat the true covariate values as unknown parameters or latent variables, which are assigned a spatial CAR prior model. Users provide the scale of observational uncertainty or ME (e.g., survey standard errors) as data cf. Bernardinelli et al., 1997;Kang et al., 2009;Logan et al., 2019;Xia & Carlin, 1998). All uncertain inferences from the ME models are automatically propagated throughout the regression or a disease mapping model, and graphical diagnostics are provided for evaluating results of spatial ME models.