spopt: a python package for solving spatial optimization problems in PySAL

1 Center for Geospatial Sciences, University of California Riverside 2 Department of Geography and Environmental Sustainability, University of Oklahoma 3 Federal University of Viçosa 4 Oak Ridge National Laboratory 5 The Peter R. Gould Center for Geography Education and Outreach, Penn State 6 University of Bristol 7 Urban Big Data Centre, School of Social & Political Sciences, University of Glasgow DOI: 10.21105/joss.03330


Statement of need
Spatial optimization methods and algorithms can be accessed in many ways. ArcGIS 1 and TransCAD 2 are two well-known commercial GIS software packages that provide modules designed for structuring and solving spatial optimization problems. The optimization functions they offer focus on a set classical single facility location methods (e.g., Weber, Median, Centroid, 1-center), routing and shortest path methods (e.g., shortest path on the network, least cost path over the terrain), and multi-facility location-allocation methods (e.g., coverage models, p-median problem). They are user-friendly and visually appealing, but the cost is relatively high (Murray, 2021).
Open-source software is another option to access spatial optimization. Although it may require users to have a certain level of programming experience, open-source software provides relatively novel and comprehensive methods, and more importantly, it is free and can be easily replicated. This is particularly true for regionalization and facility-location methods. Regionalization methods are limited in commercial GIS software, and may only have grouping analysis for vector data and region identification for raster data. On the contrary, there are many application-oriented open-source packages that facilitate the implementation of regionalization methods in various fields, including climate (e.g., HiClimR (Badr et al., 2015), synoptReg (Lemus-Canovas et al., 2019)), biology (e.g., Phyloregion (Daru et al., 2020), regioneR (Gel et al., 2015)), hydrology (e.g., nsRFA (Viglione, 2009)), agriculture (e.g., OpenLCA 3 ), and so on. The functions of graph regionalization with clustering and partitioning have been provided by several packages in R 4 such as Rgeoda (Li & Anselin, 2022), maxcut: Max-Cut Problem in sdpt3r (Rahman, 2019), and RBGL: R Boost Graph Library (Carey et al., 2022), and grPartition within MATLAB 5 . They are probably the most closely related projects to the regionalization functionality of spopt, however, they are written in R and MATLAB. For facility-location methods, commercial software such as ArcGIS and TransCAD implement models using heuristic approaches. However, they don't provide details about the solution found, which limits the interpretability of the results (Chen et al., 2021). On the other hand, existing open-source packages mostly aim at solving coverage problems such as pyspatialopt (Pulver, 2016), allagash (Pulver, 2019) and maxcovr (Tierney, 2019), but the available models, solvers, and overall accessibility vary significantly. Therefore, it is necessary to develop an open-source optimization package written in Python that includes various types of classic facility-location methods with a wide range of supported optimization solvers.

Current functionality
Originating from the region module in PySAL, spopt is under active development for the inclusion of newly proposed models and methods for regionalization and facility location. Regarding regionalization, six models are developed for aggregating a large set of geographic units (with small footprints) into a smaller number of regions (with large footprints). They are: 1. Max-p-regions: the clustering of a set of geographic areas into the maximum number of homogeneous and spatially contiguous regions such that the value of a spatially extensive regional attribute is above a predefined threshold (Duque et al., 2012;Wei et al., 2020). 2. Spatially-encouraged spectral clustering (spenc): an algorithm to balance spatial and feature coherence using kernel combination in spectral clustering (Wolf, n.d.). 3. Region-K-means: K-means clustering for regions with the constraint that each cluster forms a spatially connected component. 4. Automatic Zoning Procedure (AZP): the aggregation of data for a larger number of zones into a prespecified smaller number of regions based on a predefined type of objective function (Openshaw, 1977;Openshaw & Rao, 1995). 5. Skater: a constrained spatial regionalization algorithm based on spanning tree pruning.
Specifically, the number of edges is prespecified to be cut in a continuous tree to group spatial units into contiguous regions (Assunção et al., 2006). 6. WardSpatial: an agglomerative clustering (each observation starts in its own cluster, and pairs of clusters are chosen to merge at each step) using ward linkage (the goal is to minimize the variance of the clusters) with a spatial connectivity constraint (sklearn.cluster.AgglomerativeClustering (Pedregosa et al., 2011)).
Take the functionality of Max-p-regions as an example. Other methods can be applied in a similar process, including importing the needed packages, imputing and reading data, defining the parameters, solving the model, and plotting the solution. The corresponding solution of Max-p-regions running the above code is shown in Figure 1. It results in five regions, three of which have six states, and two with seven states each. Each region is a spatially connected component, as required by the Max-p-regions problem. For facility-location, four models, including two coverage models and two location-allocation models based on median and center problems, are developed using an exact approach with pulp (Mitchell et al., 2011) providing an interface to installed solvers.

Location Set Covering Problem (LSCP):
Finding the minimum number of facilities and their locations such that all demands are covered within the maximal threshold of service distance or time (Toregas et al., 1971). 2. Maximal Covering Location Problem (MCLP): Locating a prespecified number of facilities such that demand coverage within a maximal threshold of service distance or time is maximized (Church & ReVelle, 1974). 3. P-Median Problem: Locating p facilities and allocating the demand served by these facilities so that the total weighted assignment distance or time is minimized (ReVelle & Swain, 1970). 4. P-Center Problem: Locating p facilities and allocating the demand served by these facilities to minimize the maximum assignment distance or time between demands and their allocated facilities (Hakimi, 1964).
For example, Maximal Covering Location Model functionality is used to select 4 out of 16 store sites in the San Francisco area to maximize demand coverage, as shown in Figure 2.
Other facility-location methods can be applied in a similar way. Moreover, we have included functionality to place pre-determined site locations within the facility selection pool, allowing for realistic scenarios whereby 1 or more new facilities can be added to an existing set of locations. We believe this feature of intermingled new and existing facility location siting to be the first implementation in open-source optimization software.

Figure 2:
The solution of MCLP while siting 4 facilities using 5 kilometers as the maximum service distance between facilities and demand locations. See the "Real World Facility Location" tutorial (https://pysal.org/spopt/notebooks/facloc-real-world.html) for more details.

Planned Enhancements
Spopt is under active development and the developers look forward to your extensive attention and participation. In the near future, there are three major enhancements we plan to pursue for spopt: 1. The first stream will be on the enhancement of regionalization algorithms by including several novel extensions of the classical regionalization models, such as the integration of spatial data uncertainty and the shape of identified regions in the Max-p-regions problem. 2. The second direction involves adding capacity constraints and includes polygon partial coverage for facility location models.
No commercial and open-source software has provided these features to date.
3. We anticipate adding functionality for solving traditional routing and transportationoriented optimization problems. Initially, this will come in the form of integer programming formulations of the Travelling Salesperson Problem (Miller et al., 1960) and the Transportation Problem (Koopmans, 1949).