AuDoLab: Automatic document labelling and classification for extremely unbalanced data

AuDoLab provides a novel approach to one-class document classification for heavily imbalanced datasets, even if labelled training data is not available. Our package enables the user to create specific out-of-domain training data to classify a heavily underrepresented target class in a document dataset using a recently developed integration of Web Scraping, Latent Dirichlet Allocation Topic Modelling and One-class Support Vector Machines (Thielmann, Weisser, Krenz, & Säfken, 2021). AuDoLab can achieve high quality results even on highly specific classification problems without the need to invest in the time and cost intensive labelling of training documents by humans. Hence, AuDoLab has a broad range of scientific research or business real world applications. In the following, a few potential use cases will be briefly discussed that should illustrate the broad range of applications in various domains. For example AuDoLab could be used to identify emails with very specific topics such as fraud or money laundering that might have an extremely low prevalence. Similarly, AuDoLab could be used in the medical field to classify medical documents that are concerned with very specific topics such as heart attacks or dental problems. Furthermore, AuDoLab may be used to identify legal documents with very specific topics such as machine learning. Note that, the only limiting factor to the broad range of use cases, is the availability of out-of-domain training data, that can be generated via Web Scraping from IEEEXplore (IEEE Xplore, 2020), ArXiv or PubMed. Given that a broad range of training documents can be obtained from these websites AuDoLab has a correspondingly broad range of applications. The following section provides an overview of AuDoLab. AuDoLab can be installed conveniently via pip. A detailed description of the package and installation and can be found in the packages repository or on the documentation website.1


Summary
AuDoLab provides a novel approach to one-class document classification for heavily imbalanced datasets, even if labelled training data is not available. Our package enables the user to create specific out-of-domain training data to classify a heavily underrepresented target class in a document dataset using a recently developed integration of Web Scraping, Latent Dirichlet Allocation Topic Modelling and One-class Support Vector Machines (Thielmann, Weisser, Krenz, & Säfken, 2021). AuDoLab can achieve high quality results even on highly specific classification problems without the need to invest in the time and cost intensive labelling of training documents by humans. Hence, AuDoLab has a broad range of scientific research or business real world applications. In the following, a few potential use cases will be briefly discussed that should illustrate the broad range of applications in various domains. For example AuDoLab could be used to identify emails with very specific topics such as fraud or money laundering that might have an extremely low prevalence. Similarly, AuDoLab could be used in the medical field to classify medical documents that are concerned with very specific topics such as heart attacks or dental problems. Furthermore, AuDoLab may be used to identify legal documents with very specific topics such as machine learning. Note that, the only limiting factor to the broad range of use cases, is the availability of out-of-domain training data, that can be generated via Web Scraping from IEEEXplore (IEEE Xplore, 2020), ArXiv or PubMed. Given that a broad range of training documents can be obtained from these websites AuDoLab has a correspondingly broad range of applications. The following section provides an overview of AuDoLab. AuDoLab can be installed conveniently via pip. A detailed description of the package and installation and can be found in the packages repository or on the documentation website. 1

Statement of need
Unsupervised document classification is mainly performed to gain insight into the underlying topics of large text corpora. In this process, documents covering highly underrepresented topics have only a minor impact on the algorithm's topic definitions. As a result, underrepresented topics can sometimes be "overlooked" and documents are assigned topic prevalences that do not reflect the underlying content. Thus, labeling underrepresented topics in large text corpora is often done manually and can therefore be very labour-intensive and time-consuming. Au-DoLab enables the user to tackle this problem and perform unsupervised one-class document classification for heavily underrepresented document classes. This leverages the results of one-class document classification using One-class Support Vector Machines (SVM) (Manevitz & Yousef, 2001;Schölkopf et al., 2001) and extends them to the use case of severely imbalanced datasets. This adaptation and extension is achieved by implementing a multi-level classification rule as visualised in the graph below. The first part of the package allows the user to scrape training documents (scientific papers), ideally covering the target topic in which the user is interested, from IEEEXplore (IEEE Xplore, 2020), ArXiv or PubMed. The user can search for multiple search terms and specify an individual search query and, in the case of IEEEXplore, scrape additional information such as the author names or the number of citations. Thus, an individually labelled (e.g., via authorkeywords) training data set is created. Through the integration of pre-labelled out-of-domain training data, the problem of the heavily underrepresented target class can be circumvented, as large enough training corpora can be automatically generated. Subsequently, the text data is preprocessed for the classification part. The text preprocessing includes common Natural Language Processing (NLP) text preprocessing techniques such as stopword removal and lemmatization. As document representations the term frequency-inverse document frequency (tf-idf) representations are chosen. The tf-idf scores are computed on a joint corpus from the web-scraped out-of-domain training data and the target text data.
The second and main part of the classification rule lies in the training of the one-class SVM (Schölkopf et al., 2001). As a training corpus, only the out-of-domain training data is used. By adjusting hyperparameters, the user can create a strict or relaxed classification rule, that reflects the users belief about the prevalence of the target class inside the target data set and the quality of the scraped out-of-domain training data. The last part of the classification rule enables the user to control the classifiers results with the help of Latent Dirichlet Allocation (LDA) topic models (Blei et al., 2003) (and e.g., wordclouds). Additionally, the user can generate interactive plots depicting the identified topics during the LDA topic modelling (Sievert & Shirley, 2014).
The second step can be repeated, depending on the users perceived quality of the classification . AuDoLab: Automatic document labelling and classification for extremely unbalanced data. Journal of Open Source Software, 6(66), 3719. https://doi.org/10.21105/joss.03719 results.

Comparison with existing tools
At the moment no Python Package with a comparable functionality of AuDoLab is available, since AuDoLab is based on a novel and recently published classification prodcedure (Thielmann, Weisser, Krenz, & Säfken, 2021). Thereby, AuDoLab uses and integrates in particular a combination of Web Scraping, Topic Modelling and One-class Classifcation for which various individual packages are available. Details on the statistical methodology can be found in (Thielmann, Weisser, Krenz, & Säfken, 2021). An application of the methodology on a data set of patent data can found in . For Topic Modelling available packages are the LDA algorithm as implemented in the package Gensim (Řehůřek & Sojka, 2010) or the package TTLocVis (Kant et al., 2020) for short and sparse text. Visual representations of the topics can be implemented with LDAvis (Sievert & Shirley, 2014). The One-class SVM classification package is availabe in Scikit-learn (Pedregosa et al., 2011). Alternative further research could explore Deep Learning Algorithms as well .