HARDy: Handling Arbitrary Recognition of Data in Python

License Authors of papers retain copyright and release the work under a Creative Commons Attribution 4.0 International License (CC BY 4.0). HARDy is a Python-based package that helps evaluate differences in data through feature engineering coupled with kernel methods. The package provides an extension to machine learning by adding layers of feature transformation and representation. The workflow of the package is as follows:

HARDy is a Python-based package that helps evaluate differences in data through feature engineering coupled with kernel methods. The package provides an extension to machine learning by adding layers of feature transformation and representation. The workflow of the package is as follows: • Configuration: Sets attribute for user-defined transformations, machine learning hyperparameters or hyperparameter space • Handling: Imports pre-labelled data from .csv files and loads into the catalogue. Later the data will be split into training and testing sets • Arbitrage: Applies user defined numerical and visual transformations to all the data loaded. • Recognition: Machine Learning module that applies user defined hyperparameter search space for training and evaluation of model • Data-Reporting: Imports result of machine learning models and reports it into dataframes and plots

Statement of Need
High Throughput Experimentation (HTE) and High Throughput Testing (HTT) have exponentially increased the volume of experimental data available to scientists. One of the major bottlenecks in their implementation is the data analysis. The need for autonomous binning and classification has seen an increase in the employment of machine learning approaches in discovery of catalysts, energy materials and process parameters for design of experiment (Becker et al., 2019;Williams et al., 2019). However, these solutions rely on specific sets of hyperparameters for their machine learning models to achieve the desired purpose. Furthermore, numerical data from experimental characterization of materials carries diversity in both features and magnitude. These features are traditionally extracted using deterministic models based on empirical relationships between variables of the process under investigation. As an example, X-ray diffraction (XRD) data is easier to characterize in linear form as compared to small angle X-ray scattering data, which requires transformation of axis to log-log scale.
One of the most widely applied strategy to enhance the performance of machine learning model is Combined Automatic Machine Learning (AutoML) for CASH (Combined Alogrithm Selection and Hyperparameter Optimization) (Hutter et al., 2019). However, these packages are only limited to hyper-parameter tuning and data features remain untouched. To improve the effectiveness of machine learning models, some of the popular feature engineering strategies used for simple numerical data include binning, binarization, normalization, Box-Cox Transformations and Quantile Sketch Array (QSA) (Nargesian et al., 2017;Zheng & Casari, 2018). Moreover, Deep Feature Synthesis has also shown promising results. Here features are generated from relational databases by performing multi-layer mathematical transformation operations (Kanter & Veeramachaneni, 2015).
HARDy presents an infrastructure which aids in the identification of the best combination of numerical and visual transformations to improve data classification through Convolutional Neural Networks (CNN). HARDy exploits the difference between human-readable images of experimental data (i.e., Cartesian representation) and computer-readable plots, which maximizes the data density presented to an algorithm and reduce superfluous information. HARDy uses configuration files, fed to the open-source package KerasTuner, removing the need for the user to manually generate unique parameters combinations for each neural network model to be investigated.

Description and Use Case
The Python-based package HARDy is a modularly structured package which classifies data using 2D convolutional neural networks. A schematic for the package can be found in figure 1. The package was tested on a set of simulated Small Angle Scattering (SAS) data to be classified into four different particle models: spherical, ellipsoidal, cylindrical and core-shell spherical. A total of ten thousand files were generated for each model. The data was generated using sasmodels. The geometrical and physical parameters used to obtain each spectrum were taken from a published work discussing a similar classification task (Archibald et al., 2020).
The name of each SAS model was used as label for the data, allowing for further validation of the test set results. These models were selected as they present similar parameters and data features, which at times make it challenging to distinguish between them. First, the pre-labelled data was loaded. A subset of the files, three thousand files in total, was identified as the testing set. All the ML models initialized in the same code run were validated using the same testing set. A user-provided list of transformations, inputted through a configuration file, was then applied to the data. Different trials can be specified, so that multiple sets of transformations can be investigated. Both Cartesian and RGB plots representations were compared. The latter visualization option was obtained by encoding the data into the pixel values of each channel composing a color image, for a total of six-channels available (i.e., 3 RGB channels in horizontal/vertical orthogonal directions).
The data was then fed into a convolutional neural network, whose hyperparameters and structure were defined using another configuration file. Alternatively, it is also possible to train multiple classifiers for a single transformation trial through the use of a tuner, by instead providing a hyperparameter space and a search method. The classification results, as well as the best performing trained model were saved for each transformation run. The package also allows to visually compare, through parallel coordinates plots (see documentation), the performance of each transformation. Figure 2 shows a summary of few runs comparing the two visualization strategies and their best performing model accuracies.
Comprehensive results for all transformations tested are available in the documentation. It can be noticed that data representation using Cartesian coordinate plots yielded a higher number of instances in which the accuracy of the trained machine learning model was~25%. This value corresponds to machine learning model's inability to recognize differences in a four-class classification task. On the other hand, the RGB plots show, on average, higher accuracy for the same combinations of numerical transformations. To further validate the results, mathematical fitting was performed on a test set using the SASmodels package. The fitting was based on probabilities determined by the ML model for each label. In scenarios where the output probability was below 70%, the data was also fitted using the second highest possible SAS model. The average chi-square parameter of the fitted data was determined to be 7.5. Approximately 11 % of the data had a probability lower than 70%. In all cases,as seen in figure 3, if the neural network was not able to correctly label the data with the highest probability label, the second highest probability label was the correct one.
In conclusion, HARDy can significantly improve data classification so that automatic data fitting and modeling can be executed without human intervention and without compromising on reliability. We also note that data representation for computer-classification tasks may not follow human intuition and/or standard conventions. HARDy serves a key role in the optimization of visual data representations for CNN classification tasks. Finally, the flexibility of HARDy allows for deployment of the task on a supercomputing cluster system, possibly removing the limitations given by the high computational power required to run these ML algorithms. All configuration files and scripts used to run the example presented in this paper can be found in the package documentation.