HiPart: Hierarchical Divisive Clustering Toolbox

This paper presents the HiPart package, an open-source native python library that provides efficient and interpret-able implementations of divisive hierarchical clustering algorithms. HiPart supports interactive visualizations for the manipulation of the execution steps allowing the direct intervention of the clustering outcome. This package is highly suited for Big Data applications as the focus has been given to the computational efficiency of the implemented clustering methodologies. The dependencies used are either Python build-in packages or highly maintained stable external packages. The software is provided under the MIT license. The package's source code and documentation can be found at https://github.com/panagiotisanagnostou/HiPart.


Introduction
A highly researched problem by a variety of research communities is the problem of data clustering.However, high-dimensional data clustering still constitutes a significant challenge, plagued by the curse of dimensionality (Hutzenthaler et al., 2020).Hierarchical divisive algorithms developed in the recent years (Tasoulis et al., 2010;Pavlidis et al., 2016;Hofmeyr, 2016;Hofmeyr et al., 2019;Hofmeyr and Pavlidis, 2019) have shown great potential for the particular case of high dimensional data, incorporating dimensionality reduction iteratively within their algorithmic procedure.Additionally, they seem unique in providing a hierarchical format of the clustering result with low computational cost, in contrast to the commonly used but computationally demanding agglomerative clustering methods.
Although the discovery of a hierarchical format is crucial in many fields, such as bioinformatics (Luo et al., 2003;Modena et al., 2014), to the best of our knowledge, this package is the first native python implementation of divisive hierarchical clustering algorithms.We particularly focus on the "Principal Direction Divisive Clustering (PDDP)" algorithm (Boley, 1998) for its potential to effectively tackle the curse of dimensionality and its impeccable time performance (Tasoulis et al., 2010).
Simultaneously, we provide implementations of a complete set of hierarchical divisive clustering algorithms with a similar basis.These are the dePDDP (Tasoulis et al., 2010), the iPDDP (Tasoulis et al., 2010), the kM-PDDP (Zeimpekis and Gallopoulos, 2008), and the bisecting k-Means (BKM) (Savaresi and Boley, 2001).We also provide additional features not included in the original developments of the aforementioned methodologies that make them appropriate for the discovery of arbitrary shaped or non-linear separable clusters.In detail, we incorporate kernel Principal Component Analysis (kPCA) (Scholkopf et al., 1999) and Independent Component Analysis (ICA) (Hyvärinen and Oja, 2000;Tharwat, 2020) for the iterative dimensionality reduction steps.
As a result, the package provides a fully parameterized set of algorithms that can be applied in a diverse set of applications, for example, non-linear separable clusters, automated identification for the cluster number, and outlier control.

Software Description
The HiPart (Hierarchical Partitioning) package is divided into three major sections:

Method Implementation
The package employs an object-oriented approach for the implementation of the algorithms, similarly to that of Bach et al. (2022), while incorporating design similarities with the scikitlearn library (Pedregosa et al., 2011).Meaning, a class executes each of the algorithms, and the class's parameters and attributes are the algorithm's hyper-parameters and results.
For the execution of the algorithms, the user needs to call either the method predict or fit predict of each algorithm's execution class.The algorithm parameterization can be applied at the constructor of their respective class.

Static Visualization
Two static visualization methods are included.The first one is a 2-Dimensional representation of all the data splits generated by each algorithm during the hierarchical procedure.The goal is to provide an insight to the user regarding each node of the clustering tree and, subsequently, each step of the algorithm's execution.
The second visualization method is a dendrogram that represents the splits of all the divisive algorithms.The dendrogram's figure creation is implemented by the SciPy package, and it is fully parameterized as stated in the library.

Interactive Visualization
In the interactive mode, we provide the possibility for stepwise manipulation of the algorithms.The user can choose a particular step (node of the tree) and manipulate the split-point on top of a two-dimensional visualization, instantly altering the clustering re-sult.Each manipulation resets the algorithm's execution from that step onwards, resulting in a restructuring of the sub-tree of the manipulated node.

Development Notes
For the development of the package, we complied with the PEP8 style standards, and we enforced it with the employment of flake8 command-line utility.To ensure the code's quality, we implemented the unittest module to the entirety of the source code.In addition, platform compatibility has been assured through extensive testing, and the package development in its entirety uses only well-established or native python packages.The package has been released as open-source software under the "MIT License".For more information regarding potential contributions or for the submission of an issue, or a request, the package is hosted as a repository on Github.

Experiments and Comparisons
In this section, we provide clustering results with respect to the execution speed and clustering performance for the provided implementations.For direct comparison, we employ a series of well-established clustering algorithms.These are the k-Means (Likas et al., 2003), the Agglomerative (AGG) (Ackermann et al., 2014) and the OPTICS (Ankerst et al., 1999) of the scikit-learn (Pedregosa et al., 2011) python library and the fuzzy c-means (FCM) algorithm (Bezdek et al., 1984) of the fuzzy-c-means (Dias, 2019) python package.Clustering performance is evaluated using the Normalized Mutual Information (NMI) score (Yang et al., 2016).
Four widely used data sets from the field of bioinformatics are employed along with two popular data sets benchmark data set for text and image clustering, respectively: • the Deng, (Deng et al., 2014) • the TGCA Pan-cancer1 (Cancer), • the USPS, (Hull, 1994) • the Baron, (Baron et al., 2016) • the Chen, (Chen et al., 2017) • the BBC, (Greene and Cunningham, 2006), All experiments took place on a server computer with Linux operating system, kernel version 5.11.0, with an Intel Core i7-10700K CPU @ 3.80GHz and four DDR4 RAM dims of 32GB with 2133MHz frequency.Default parameters were used for the execution of all the algorithms, and the actual number of clusters was provided to algorithms as a parameter when required.
In Table 1 we present the mean performance of all methods with respect to execution time (time in secs) and NMI across 100 experiments.We observe that HiPart implementations perform exquisitely in terms of execution time while still being comparable with respect to clustering performance.

Conclusions and Future Work
We present a highly time-efficient clustering package with a suite of tools that give the capability of addressing problems in high-dimensional data clustering.Also, the developed new visualization tools enhance understanding and identification of the underlying clustering data structure.
We plan to continuously expand the HiPart package in the future through the addition of more hierarchical algorithms and by providing even more options for dimensionality reduction, such as the use of recent projection pursuit methodologies (Pavlidis et al., 2016;Hofmeyr, 2016;Hofmeyr et al., 2019;Hofmeyr and Pavlidis, 2019).Our final aim is to establish the golden standard when considering hierarchical divisive clustering.

Figure 1 :
Figure 1: Dendrogram figure for the Cancer data set with the use of the dePDDP algorithm and the dendrogram visualization module of the HiPart library.The line below the tree represents the colour of the original cluster each sample belongs.

Table 1 :
Clustering results with respect to execution time and clustering performance.