CliquePercolation: An R Package for conducting and visualizing results of the clique percolation network community detection algorithm

Modeling complex phenomena as networks constitutes one of the – if not the most – versatile field of research (Barabási, 2011). Indeed, many interconnected entities can be represented as networks, in which entities are called nodes and their connections are called edges. For instance, networks can represent friendships between people, hyperlinks between web pages, or correlations between questionnaire items. One structural characteristic of networks that is investigated frequently across various sciences is the detection of communities (Fortunato, 2010). Communities are strongly connected subgraphs in the network such as groups of friends, thematic fields, or latent factors. Most community detection algorithms thereby put each node in only one community. However, nodes are often shared by multiple communities, e.g., when a person is part of multiple groups of friends, web pages belong to different thematic fields, or items load on multiple factors. The most popular community detection algorithm that is aimed at identifying such overlapping communities is the clique percolation algorithm (Farkas et al., 2007; Palla et al., 2005).


Summary and Statement of Need
Modeling complex phenomena as networks constitutes one of the -if not the most -versatile field of research (Barabási, 2011). Indeed, many interconnected entities can be represented as networks, in which entities are called nodes and their connections are called edges. For instance, networks can represent friendships between people, hyperlinks between web pages, or correlations between questionnaire items. One structural characteristic of networks that is investigated frequently across various sciences is the detection of communities (Fortunato, 2010). Communities are strongly connected subgraphs in the network such as groups of friends, thematic fields, or latent factors. Most community detection algorithms thereby put each node in only one community. However, nodes are often shared by multiple communities, e.g., when a person is part of multiple groups of friends, web pages belong to different thematic fields, or items load on multiple factors. The most popular community detection algorithm that is aimed at identifying such overlapping communities is the clique percolation algorithm (Farkas et al., 2007;Palla et al., 2005).
The clique percolation algorithm is not yet implemented in a package in R (R Core Team, 2020). So far, the primary software for running the algorithm is the standalone program CFinder, written in C++ and Java (Adamcsek et al., 2006). However, CFinder cannot be used to construct networks from data or to visualize the solutions of the algorithm, requiring the simultaneous use of other software such as R. Handling multiple programs impedes a smooth workflow. Next to CFinder, an R function for running one variant of the clique percolation algorithm is available in a GitHub repository. However, it is not implemented in a package and lacks functions for optimizing parameters of the algorithm as well as plotting its results. CliquePercolatio overcomes these limitations as it entails functions for helping to optimize parameters of the algorithm, running the algorithm, and plotting the results.

A minimal example
The structure of a network can be captured in a matrix. An undirected network of n nodes translates into a symmetric square n-by-n matrix. Each element a ij takes the value 0, if there is no edge between nodes i and j. If there is an edge, in a unweighted network, a ij takes the value 1, and in a weighted network, it takes any non-zero value. The R package qgraph (Epskamp et al., 2012) can visualize such networks. For instance, a weighted network with eight nodes a to h as depicted in Figure 1  The clique percolation algorithm proceeds in two steps. First, it identifies k-cliques in the network, i.e., fully conntected subgraphs with k nodes, when the geometric mean of their edge weights exceeds the Intensity threshold I. Second, communities are defined as sets of adjacent k-cliques, i.e., k-cliques that share k − 1 nodes, allowing some nodes to be shared by communities or to be isolated.
The package CliquePercolation facilitates executing these steps. First, it helps identifying optimal values for k and I. For very small networks (as in Figure 1), the entropy of the community partition should be maximized (treating isolated nodes as a separate community).
where N is the number of communities and p i is the probability of being in community i. Entropy is maximal when the resulting communities are equally sized with a small number of isolated nodes. A permutation test, which repeatedly randomly shuffles the edges in the network and recalculates entropy can point out which entropy values are higher than already expected by chance.
In CliquePercolation the cpThreshold function calculates entropy for a range of k and I values.  The highest entropy results for k = 3 and I = 0.09, which can be used to run the clique percolation algorithm for weighted networks with the cpAlgorithm function.
The function cpColoredGraph can visualize the results. For instance, using the default color scheme, all nodes that belong to the same community get the same color, shared nodes are split in multiple parts with colors for each community they belong to, and isolated nodes are white (see Figure 2). Beyond this minimal example, the CliquePercolation package provides more functionality for applying the clique percolation algorithm to different kinds of networks and plotting the results. The full suite of possibilities is described in the package vignette, which is available by running vignette("CliquePercolation"). Moreover, an elaborate blog post used the package in research on a psychological disorder network and a recent publication applied the package in research on emotions (Lange & Zickfeld, 2021).