Inscriptis -- A Python-based HTML to text conversion library optimized for knowledge extraction from the Web

Inscriptis provides a library, command line client and Web service for converting HTML to plain text. Its development has been triggered by the need to obtain accurate text representations for knowledge extraction tasks that preserve the spatial alignment of text without drawing upon heavyweight, browser-based solutions such as Selenium. In contrast to related software packages, Inscriptis (i) provides a layout-aware conversion of HTML that more closely resembles the rendering obtained from standard Web browsers; and (ii) supports annotation rules, i.e., user-provided mappings that allow for annotating the extracted text based on structural and semantic information encoded in HTML tags and attributes. These unique features ensure that downstream knowledge extraction components can operate on accurate text representations, and may even use information on the semantics and structure of the original HTML document.


Summary
Inscriptis provides a library, command line client and Web service for converting HTML to plain text.
Its development has been triggered by the need to obtain accurate text representations for knowledge extraction tasks that preserve the spatial alignment of text without drawing upon heavyweight, browser-based solutions such as Selenium [9]. In contrast to existing software packages such as HTML2text [23], jusText [2] and Lynx [5], Inscriptis 1. provides a layout-aware conversion of HTML that more closely resembles the rendering obtained from standard Web browsers and, therefore, better preserves the spatial arrangement of text elements. Inscriptis excels in terms of conversion quality, since it correctly converts complex HTML constructs such as nested tables and also interprets a subset of HTML (e.g., align, valign) and CSS (e.g., display, white-space, margin-top, vertical-align, etc.) attributes that determine the text alignment.
2. supports annotation rules, i.e., user-provided mappings that allow for annotating the extracted text based on structural and semantic information encoded in HTML tags and attributes used for controlling structure and layout in the original HTML document.
These unique features ensure that downstream knowledge extraction components can operate on accurate text representations, and may even use information on the semantics and structure of the original HTML document, if annotation support has been enabled.

Statement of need
Research in a growing number of scientific disciplines relies upon Web content. Li et al. [12], for instance, studied the impact of company-specific News coverage on stock prices, in medicine and pharmacovigilance social media listening plays an important role in gathering insights into patient needs and the monitoring of adverse drug effects [4], and communication sciences analyze media coverage to obtain information on the perception and framing of issues as well as on the rise and fall of topics within News and social media [21,26].
Computer science focuses on analyzing content by applying knowledge extraction techniques such as entity recognition [8] to automatically identify entities (e.g., persons, organizations, locations, products, etc.) within text documents, entity linking [6] to link these entities to knowledge bases such as Wikidata and DBPedia, and sentiment analysis to automatically assess sentiment polarity (i.e., positive versus negative coverage) and emotions expressed towards these entities [24].
Most knowledge extraction methods operate on text and, therefore, require an accurate conversion of HTML content which also preserves the spatial alignment between text elements. This is particularly true for methods drawing upon algorithms which directly or indirectly leverage information on the proximity between terms, such as word embeddings [13,16], language models [18], sentiment analysis which often also considers the distance between target and sentiment terms, and automatic keyword and phrase extraction techniques.
Despite this need from within the research community, many standard HTML to text conversion techniques are not layout aware, yielding text representations that fail to preserve the text's spatial properties, as illustrated below: Consequently, even popular resources extensively used in literature suffer from such shortcomings. The text representations provided with the Common Crawl corpus 1 , for instance, have been generated with a custom utility [10] which at the time of writing did not consider any layout information. Datasets such as CCAligned [7] multilingual C4 which has been used for training the mT5 language model [27], and OSCAR [22] are based on subsets of the Common Crawl corpus [3].
Even worse, some tutorials suggest the use of software libraries such as Beautiful Soup [19], lxml [1] and Cheerio [14] for converting HTML. Since these libraries have been designed with a different use case in mind, they are only well-suited for scraping textual content. Once they encounter HTML constructs such as lists and tables, these libraries are likely to return artifacts (e.g., concatenated words), since they do not interpret HTML semantics. The creators of the Cheerio library even warn their users, by explicitly stating that it is not well-suited for emulating Web browsers.
Specialized conversion tools such as HTML2Text perform considerably better but often fail for more complex Web pages. Researchers sometimes even draw upon text-based Web browsers such as Lynx to obtain more accurate representations of HTML pages. These tools are complemented by content extraction software such as jusText [2], dragnet [17], TextSweeper [11] and boilerpy3 [20] which do not consider the page layout but rather aim at extracting the relevant content only, and approaches that are optimized for certain kinds of Web pages like Harvest [25] for Web forums.
Inscriptis, in contrast, not only correctly renders more complex websites but also offers the option to preserve parts of the original HTML document's semantics (e.g., information on headings, emphasized text, tables, etc.) by complementing the extracted text with annotations obtained from the document. Figure 2 provides an example of annotations extracted from a Wikipedia page. These annotations can be useful for • providing downstream knowledge extraction components with additional information that may be leveraged to improve their respective performance. Text summarization techniques, for instance, can put a stronger emphasis on paragraphs that contain bold and italic text, and sentiment analysis may consider this information in addition to textual clues such as uppercase text. • assisting manual document annotation processes (e.g., for qualitative analysis or gold standard creation). Inscriptis supports multiple export formats such as XML, annotated HTML and the JSONL format that is used by the open source annotation tool doccano 2 [15]. Support for further annotation formats can be easily added by implementing custom annotation post-processors.
• enabling the use of Inscriptis for tasks such as content extraction (i.e., extract taskspecific relevant content from a Web page) which rely on information on the HTML document's structure.
In conclusion, Inscriptis provides knowledge extraction components with high quality text representations of HTML documents. Since its first public release in March 2016, Inscriptis has been downloaded over 135,000 times from the Python Package Index (PyPI) 3 , has proven its capabilities in national and European research projects, and has been integrated into commercial products such as the webLyzard Web Intelligence and Visual Analytics Platform.

Mentions
The following research projects use Inscriptis within their knowledge extraction pipelines: • CareerCoach: "Automatic Knowledge Extraction and Recommender Systems for Personalized Re-and Upskilling suggestions" funded by Innosuisse.
• Job Cockpit: "Web analytics, data enrichment and predictive analysis for improved recruitment and career management processes" funded by Innosuisse

Acknowledgements
Work on Inscriptis has been conducted within the MedMon, Job Cockpit and CareerCoach projects funded by Innosuisse.