geocmeans: An R package for spatial fuzzy c-means

Unsupervised classification methods like k-means or the Hierarchical Cluster Analysis (HCA) are widely used in geography even though they are not well suited for spatial data [Romary et al. (2015);] because they do not consider space. Yet, recent development has been proposed to include the geographical dimension into clustering. As an example, ClustGeo (Chavent et al., 2018) is a spatial extension of the HAC, available in the R package with the same name. We present here the R package geocmeans, proposing several spatial extensions of the Fuzzy C-Means (FCM) algorithm to complete this growing toolbox with a fuzzy approach. The package provides also several helper functions to assess and compare quality of classifications, select appropriate hyperparameters, and interpret the final groups.


Statement of need
Traditional clustering algorithms do not account for the geographical dimension of data causing two main concerns: • First, an important part of the information related to observations' locations and geographical organization is disregarded.Often, in social, environmental and economical sciences, the geographical dimension is highly structuring and displays positive spatial autocorrelation.• Second, in many applications (like the development of regional policy), it is desirable that close observations are more likely to belong to the same group.In this regard, it is common to observe with traditional unsupervised classification algorithms "holes": observations attributed to a certain group and surrounded by observations attributed to another.Often, the difference in the attributes between those observations does not justify such spatial inconsistency.
A first approach for spatial clustering explored from the 1990s was to impose to the classification a spatial constraint based on contiguity (also called aggregation based approach).The most probably known methods are the AZP (Openshaw, 1977), SKATER (Assunção et al., 2006) and AMOEBA (Aldstadt & Getis, 2006).However, this approach can only yield spatially continuous groups and can be too strict to obtain meaningful clusters.That is why a new approach has been proposed: including the spatial dimension in the clustering process.In this approach, space should not act as a constraint in the classification, but as supplementary data.
The recent method ClustGeo (Chavent et al., 2018) has received some attention from geographers.However, it yields a hard clustering and may hide situations were observations are located at the border of two groups.There is thus a need for available unsupervised spatial fuzzy clustering methods.
Such methods have been discussed and applied in the brain imagery segmentation field (Cai et al., 2007;Zhao et al., 2013) leading to the development of the Spatial Fuzzy C-means (SFCM) algorithm.The package geocmeans is proposed to make this tool available to researchers and professionals working with spatial datasets.A first application to construct a socio-residential and environmental taxonomy in Lyon has outlined the potential of the method (Gelb & Apparicio, 2021).

Core functionality
The geocmeans package has been built to be self-sufficient and minimize the coding need from users.Thus, several helper functions are available along the main clustering algorithms to asses classification quality, select hyper parameters, test cluster robustness, and interpret results.It is also possible to use the results from external clustering algorithm to build FCMres objects and work with most of the functions of geocmeans.

Main algorithm
geocmeans provides four fuzzy unsupervised classification algorithms: • CMeans, the original c-means algorithm, requiring two hyper parameters, m (fuzzyness degree) and k (number of groups).
• GFCMeans, the so-called generalized c-means algorithm.It is known to accelerate convergence and yield less fuzzy result by adjusting the membership matrix at each iteration.It requires an extra  parameter controlling the strength of the modification.The modification only affects the formula updating the membership matrix. with: the probability for observation k to belong to cluster i   the observation k in the dataset x   the cluster i m the fuzzyness parameter • SFCMeans, the SFCM algorithm, requiring two more parameters W and .W is a spatial weight matrix used to calculate a spatially lagged version of the dataset x.  is used to control the weight of the spatially lagged dataset.If  = 0 then SFCM becomes a simple FCM.If  = 1 the same weight is given to the original and lagged dataset.If  = 2 then the spatially lagged dataset has a weight doubled in comparison with the original dataset, and so on… The integration of the spatially lagged dataset modifies the formula updating the membership matrix and the formula updating the centers of clusters.
x the spatially lagged version of x As the formula suggests, the SFCM can be seen as a spatially smoothed version of the FCM and  controls the degree of spatial smoothness.This smoothing can be interpreted as an attempt to reduce spatial overfitting of the FCM.
• SGFCMeans, the SGFCM algorithm, combining SFCM and SGFCM and thus requiring the definition of three extra parameters W,  and .Only the formula to calculate the membership matrix is different from the SFCM.

Robust versions
FCM, GFCM, SFCM and SGFCM can be modified to use a "robust" version of the FCM (Tsai & Lin, 2011) The "robust" versions of the algorithms normalize the distances between the data points and the centres of the clusters.This modification improves the performance of the algorithms for clusters with non-hyperspherical shapes and uneven densities.
For the FCM algorithm, the new formula to calculate the membership matrix is the following: For the SFCM, the same normalization is applied on the distances calculated from the lagged dataset. With:

Noise cluster
For each algorithm, it is possible to add a noise cluster.This extra cluster does not have a centre and has the same distance () to each data point (Dave, 1991).It is used to catch observations that are too far from the clusters and could be identified as outliers. is also a hyperparameter to be selected by the user.

Selecting parameters
As stated above, up to five hyper parameters have to be selected by the user (six if a noise cluster is added).Finding the best combination is facilitated by the function selectParameters calculating the classifications for all the possible combinations of parameters in specified ranges and returning several metrics of classification quality.The documentation also provides an example of how to select the hyperparameters with a multi-objective genetic algorithm.

Interpreting the results
To interpret the results, several functions are provided: • summarizeClusters: returning summary statistics for each cluster for in-depth analysis.
• spiderPlots: plotting a spider chart to quickly differentiate the clusters.
• violinPlots: plotting one violin plot split by cluster for each variable in the dataset.
• barPlots: plotting a bar plot to compare means or medians of each cluster.
• mapClusters: mapping the membership matrix and the most likely cluster for the observations.• calcqualityIndexes: returning several quality indexes for a classification.
• spatialDiag: performing a complete spatial diagnostic to determine if the inclusion of space in the classification is justified.
The package includes a ShinyApp (sp_clust_explorer) which can be used to investigate the results interactively (mapping the clusters, bivariate charts, uncertain observations, etc.).
Also, several methods using the R classical S3 object system are available : • summary: same as summarizeClusters.
• print: a generic function to display the main attributes of a solution.
• predict: a function to predict the membership of new data points.

Cluster stability
Considering the random initial state of each algorithm, its is important to asses the robustness of the solution.The function boot_group_validation uses a boostrap approach to evaluate if the clusters of a solution are stable or tend to dissolve.During a selected number of iterations (at least 1000), a sample of size n (with replacement) is drawn from the original dataset.
For each sample, the same classification algorithm is applied and the results are compared with the reference results.For each original group, the most similar group is identified by calculating the Jaccard similarity index between the columns of the two membership matrices.This index is comprised between 0 (exact difference) and 1 (perfect similarity) and a value is calculated for each group at each iteration.

Raster data
Considering the important use of unsupervised classification methods on aerial or satellite imagery, geocmeans can also handle spaRast objects from the package terra and provides functions to build spatial smoothing windows for SFCM and SFGCM.

Example
The data used for the socio-residential and environmental taxonomy in Lyon are included in the package.The following example uses this data to demonstrate the basic functionality of the package.More details are given in the vignettes of the package.