ordPens: An R package for Selection, Smoothing and Principal Components Analysis for Ordinal Variables

Ordinal data are a common case in applied statistics. In order to incorporate the ordinal scale level, among other things, regularization techniques are often suggested in the literature (Tutz & Gertheiss, 2014, 2016). In particular, penalization approaches for smoothing and selection when dealing with Likert-type data – which are by no means restricted to Likert scale – are commonly proposed. ordPens is a package in the R programming language (R Core Team, 2021) and provides several penalty approaches for ordinal predictors in regression models and ordinal variables for principal component analysis (PCA). In the regression context, smoothing is obtained by introducing a penalty term and a tuning parameter controling the amount of penalty. Adding the penalty term to the likelihood function yields the penalized likelihood, which is then maximized. Different types of penalization can be considered, depending on whether to achieve smoothing, selection or clustering of variables. Smoothing only can be done by penalizing the sum of squared differences of adjacent coefficients for a given variable, subject to proper ordering. A modified group lasso based on a difference penalty can be used for selection. Clustering/fusion of categories can be achieved by the fused lasso penalizing absolute differences by using the L1-norm.


Summary
Ordinal data are a common case in applied statistics. In order to incorporate the ordinal scale level, among other things, regularization techniques are often suggested in the literature (Tutz & Gertheiss, 2014. In particular, penalization approaches for smoothing and selection when dealing with Likert-type data -which are by no means restricted to Likert scale -are commonly proposed. ordPens is a package in the R programming language (R Core Team, 2021) and provides several penalty approaches for ordinal predictors in regression models and ordinal variables for principal component analysis (PCA). In the regression context, smoothing is obtained by introducing a penalty term and a tuning parameter controling the amount of penalty. Adding the penalty term to the likelihood function yields the penalized likelihood, which is then maximized. Different types of penalization can be considered, depending on whether to achieve smoothing, selection or clustering of variables. Smoothing only can be done by penalizing the sum of squared differences of adjacent coefficients for a given variable, subject to proper ordering. A modified group lasso based on a difference penalty can be used for selection. Clustering/fusion of categories can be achieved by the fused lasso penalizing absolute differences by using the L 1 -norm.

Statement of Need
As suggested by Tutz & Gertheiss (2014) and Tutz & Gertheiss (2016), selection, and/or smoothing/fusing of ordinally scaled independent variables shall be done using a modified group lasso or generalized ridge penalty when dealing with ordinally scaled predictors in regression analysis. The penalized log-likelihood to be maximized takes the form l p (β) = l(β) − λJ (β), with β corresponding to the vector of regression parameters, λ representing the smoothing parameter and J (·) being the penalty function. ordPens  offers various tools for data analysis of ordinally scaled data. The package attacks the afore mentioned tasks and offers penalized regression for smoothing, selection and fusion. Specifically, the function ordSmooth() for smoothing only incorporates the generalized ridge penalty with D d,s being the matrix generating differences of order d and β T s = (β s1 , ..., β sks ) being the parameter vector linked to the sth (dummy-coded) predictor with categories 1, ..., k s . The ordSelect() function performs smoothing and selection by adopting a modified group lasso penalty based on differences of the form Clustering of categories is done by the function ordFusion(), which uses a fused lasso penalty based on differences of first order: For more information on the original group lasso (for nominal predictors and grouped variables in general), see Meier et al. (2008) and Yuan & Lin (2006). For details on the fused lasso, see Tibshirani et al. (2005). In the case of smoothing only, the package includes auxiliary functions such that mgcv::gam() (Wood, 2008(Wood, , 2017 can be used for fitting generalized linear and additive models with first-and second-order ordinal smoothing penalty as well as built-in smoothing parameter selection. Also, mgcv tools for further statistical inference can be used, see  for details. Furthermore, testing for differences in the means, known as analysis of variance (ANOVA), is provided for ordered factors by the function ordA OV() penalizing (squared) differences of adjacent means. Testing for differentially expressed genes, when analyzing microarrays of gene expression data, is incorporated by the function ordGene(). Technical details can be viewed from Gertheiss (2014) and Sweeney et al. (2016), respectively.
If, in contrast, dimension reduction is desired in an unsupervised way, principal components analysis can be applied to ordinal data as well. However, those data are usually either treated as numeric implying linear relationships between the variables at hand, or non-linear PCA is applied where the obtained coefficients are sometimes hard to interpret. Note that in IBM SPSS Statistics (Version 25.0), for instance, there is an option available for smoothing quantifications by the use of spline functions, which, however, limits the type of functions that can be fitted when using a small number of knots and a suitable choice may be challenging for the (inexperienced) user. On the other hand, as splines are defined on interval scale whereas ordinal variables can only take some discrete values, the usage of spline functions may be seen as unnecessarily complex for scaling ordinal data. To incorporate the ordinal scale level, the concept of penalization can also be adapted here, as suggested in Hoshiyar et al. (2021). Penalized non-linear principal components analysis for ordinal variables is incorporated in the function ordPCA() using a second-order difference penalty. In addition, the function provides performance evaluation and selection of an optimal penalty parameter using k-fold cross-validation. Also, the option of both non-monotone effects and incorporating constraints enforcing monotonicity is provided. Penalized non-linear PCA therefore serves as an intermediate between the standard methods typically used so far (see above). The new approach offers both better interpretability as well as better performance on validation data.
A topic of future research would be the analysis of dependencies within a (high dimensional) set of ordinal variables by graphical models. A further typical approach when dealing with ordinal data is motivated by assuming a latent continuous variable linked to the ordinal variable via thresholds. The proportional odds model, which is also motivated as a latent variable approach, in combination with the ordinal penalty could be also of interest for future research. Another interesting field is found in Huang et al. (2021), who analyze (mixed) ordinal dependencies using a latent Gaussian copula model based on rank correlations. Assuming a latent continuous variable, however, may not always be desirable by the data analyst. The methods implemented in ordPens (up to version 1.0.0) therefore do not underly the latent variable assumption.

Availability
The R package ordPens is publicly available on CRAN and Github, where issues can be opened. ordPens is licensed under the GPL-2 General Public License. Documentation and examples are contained in the package manual, which can be found on CRAN. To install ordPens, simply run: install.packages("ordPens") For penalized regression and ordinal ANOVA see also vignette("ordPens", package = " ordPens"). Penalized non-linear PCA is also documented in detail and can be accessed via vignette("ordPCA", package = "ordPens").