RandomForestsGLS: An R package for Random Forests for dependent data

Summary With the modern advances in geographical information systems, remote sensing technologies, and low-cost sensors, we are increasingly encountering datasets where we need to account for spatial or serial dependence. Dependent observations (y1, y2, …, yn) with covariates (x1, ..., xn) can be modeled non-parametrically as yi = m(xi) + ϵi, where m(xi) is mean component and ∈i accounts for the dependency in data. We assume that dependence is captured through a covariance function of the correlated stochastic process ∈i (second order dependence). The correlation is typically a function of “spatial distance” or “time-lag” between two observations. Unlike linear regression, non-linear Machine Learning (ML) methods for estimating the regression function m can capture complex interactions among the variables. However, they often fail to account for the dependence structure, resulting in sub-optimal estimation. On the other hand, specialized software for spatial/temporal data properly models data correlation but lacks flexibility in modeling the mean function m by only focusing on linear models. RandomForestsGLS bridges the gap through a novel rendition of Random Forests (RF) – namely, RF-GLS – by explicitly modeling the spatial/serial data correlation in the RF fitting procedure to substantially improve the estimation of the mean function. Additionally, RandomForestsGLS leverages kriging to perform predictions at new locations for geo-spatial data.

Unlike linear regression, non-linear Machine Learning (ML) methods for estimating the regression function m can capture complex interactions among the variables.However, they often fail to account for the dependence structure, resulting in sub-optimal estimation.On the other hand, specialized software for spatial/temporal data properly models data correlation but lacks flexibility in modeling the mean function m by only focusing on linear models.RandomForestsGLS bridges the gap through a novel rendition of Random Forests (RF) -namely, RF-GLS -by explicitly modeling the spatial/serial data correlation in the RF fitting procedure to substantially improve the estimation of the mean function.Additionally, RandomForestsGLS leverages kriging to perform predictions at new locations for geo-spatial data.

Statement of need
RandomForestsGLS is a statistical machine learning oriented R package for fitting RF-GLS on dependent data.The RF-GLS algorithm described in (Saha et al., 2021) involves computationally intensive linear algebra operations in nested loops which are especially slow in an interpreted language like R. RandomForestsGLS efficiently implements RF-GLS algorithm in a low-level language with a user-friendly interface in R, a popular computational software (free under GNU General Public License) in the statistics community.RandomForestsGLS focuses on fast, parallelizable implementations of RF-GLS for spatial and time series data, which includes popular choices for covariance functions for both spatial (Matérn GP) and time series (autoregressive) data.The package is primarily designed to be used by researchers associated with the fields of statistical machine learning, spatial statistics, time series analysis, and their scientific applications.A significant part of the code has already been used in Saha et al. (2021).With the combination of speed and ease-of-use that RandomForestsGLS brings to the table regarding non-linear regression analysis for dependent data, we hope to see this package being used in a plethora of future scientific and methodological explorations.

State of the field
Several R packages implement classical RF.Most notable of them being randomForest, which implements Breiman's Random Forests (Breiman, 2001) for Classification and Regression using the Fortran.Some of the other packages are xgboost, randomForestSRC, ranger, Rborist.For a detailed overview of it, we refer the reader to CRAN Task View: Machine Learning & Statistical Learning.To the best of our knowledge, none of these packages explicitly account for spatial and/or temporal correlation.
Classical RF has been used in geo-spatial and temporal applications (see Saha et al. (2021) 2021) for details).Both approaches try to account for the dependence structure by incorporating additional spatial covariates, which adversely affect the prediction performance in the presence of a dominant covariate effect (Saha et al. (2021)).Additionally, unlike RF-GLS, they cannot estimate covariate effects separately from the spatial effect, which can be of independent interest.The RF for temporal data, proposed in Basak et al. (2019), suffers from similar shortcomings.

The RandomForestsGLS package
We provide a brief overview of the functionality of the package.The package vignette demonstrates with example how to use the package for non-linear regression analysis of dependent data.Specific functions are discussed in detail in the documentation of the package.

RF to RF-GLS: Accounting for correlation structure
In classical RF, which is an average of many regression trees, each node in a regression tree is split by optimizing the CART split criterion in Breiman et al. (1984).It can be rewritten in the following way: where Z (0) and Z are the binary membership matrices for the leaf nodes of the tree before and after the potential node split.(d, c) denotes a potential cut (location of the split), with d and c being the cut direction (choice of the covariate) and cutoff point (value of the covariate) respectively, β(Z) are the leaf node representatives given by OLS estimates corresponding to a design matrix Z and can be written as: We observe that the split criterion is the difference of OLS loss functions before and after the cut with the respective design matrices of membership of the leaf nodes.We can incorporate the correlation structure of the data in the split criterion by replacing the OLS loss with GLS loss as is traditionally done in linear models.The modified split criterion can be rewritten as: where Q is the inverse of the working covariance matrix that models the spatial/serial dependence and βGLS (Z) are the leaf node representatives given by the GLS estimates corresponding to a design matrix Z and can be written as follows:

Model
We consider spatial point-referenced data with the following mixed model: where y i , x i respectively denotes the observed response and the covariate corresponding to the i th observed location s i .m(x i ) denotes the covariate effect; spatial random effect, w(s) accounts for spatial dependence beyond covariates modeled using a GP, and accounts for independent and identically distributed random Gaussian noise.

Fitting & Prediction
Spatial random effects are modeled using a Gaussian process (GP) as is the practice.We use the computationally convenient Nearest Neighbor GP (NNGP) (Datta et al., 2016).Model parameters, if unknown, are estimated from the data (Saha et al. (2021)).Alongside predicting covariate effects for a new covariate value, we also offer spatial prediction at new locations with non-linear kriging by combining the non-linear mean estimate and spatial kriging estimate from the BRISC (Saha & Datta, 2018) package.

Autoregressive (AR) Time Series Data
Model RF-GLS can also be used for function estimation in a time series setting under autoregressive (AR) errors.We consider time series data with errors from an AR(q) (autoregressive process of lag q) process as follows: ρ i e t−i + η t where y i , x i denote the response and the covariate corresponding to the t th time point, respectively, e t is an AR(q) process, η t denotes i.i.d.white noise and (ρ 1 , • • • , ρ q ) are the model parameters that capture the dependence of e t on (e t−1 , • • • , e t−q ).

Fitting & Prediction
RF-GLS exploits the sparsity of the closed form precision matrix of the AR process for model fitting and prediction of the mean function m(.).If the AR model parameters (coefficients and order of autoregressive process) are unknown, the code automatically estimates them from AR models with specified lags.The prediction of covariate effect for time series data is like that of spatial data.

Discussion
This package provides an efficient, parallel implementation of the RF-GLS method proposed in Saha et al. (2021) which accounts for correlated data with modified node splitting criteria and node representative update rule.The package accounts for spatial correlation via Matérn GP or serial autocorrelation.It provides parameter estimation for both covariance structures.Efficient implementation through C/C++ takes advantage of the NNGP approximation and scalable node splitting update rules which reduces the execution time.More details on package features, parameter estimation, and parallelization can be found in the package README.
Since the computational complexity of evaluating potential splits is cubic in the number of leaf nodes for RF-GLS, but constant for standard RF, improving the computational complexity of the algorithm is of independent research interest.We also plan to implement and validate additional forms of time series dependency.
for references) without making methodological adjustments to account for spatial dependencies.(CRAN Task View: Analysis of Spatial Data, CRAN Task View: Time Series Analysis).Two recent works that attempt to explicitly use spatial information in RF for prediction purposes, are Hengl et al. (2018) ( GeoMLA) and Georganos et al. (2019) ( SpatialML) (see Saha et al. (