hdt-rs: A Rust library for the Header Dictionary Triples binary RDF compression format

We present the Rust library hdt-rs (named “hdt” in the context of Rust libraries, such as on crates.io) for the Header Dictionary Triples (HDT) binary RDF compression format. This allows the writing of high-performance Rust applications that load and query HDT datasets using triple patterns. Existing Rust applications that use the Sophia library (Champin, 2020) can easily and greatly reduce their RAM usage by using the provided Sophia HDT adapter

assigns a unique numerical identifier (ID) to each of them.This allows the triples component to store the adjacency matrix of the graph using those IDs in compressed form.All patterns with constant subject (SPO, SP?, SO?, and S??) as well as the one with all variables (???) are answered using the Bitmap Triples structure (see Figure 1), while the other patterns use the HDT Focused on Querying (HDT-FoQ) extension, see Figure 2. As HDT is a complex format, we recommend referring to Martínez-Prieto et al. (2012) and Fernández et al. (2013) for comprehensive documentation.

Statement of need
Semantic Web technologies have been adopted by major tech companies in recent years but widespread use is still inhibited by a lack of freely available performant, accessible, robust, and adaptable tooling (Hitzler, 2021).SPARQL endpoints provide a standard publication channel and API to any RDF graph but they are not suitable for all use cases.On small graphs, there is a large relative overhead in both memory and CPU resources.On large graphs, on the other hand, query complexity and shared access may cause an overload of the server, causing delayed or missed responses.The long-term availability of SPARQL endpoints is often compromised (Buil-Aranda et al., 2013), which impacts all applications that depend on them.
To insulate against such problems, Semantic Web applications can integrate and query an RDF graph using libraries such as Apache Jena (Carroll et al., 2004) for Java, RDFlib (Swartz et al., 2023) for Python, librdf (Beckett et al., 2015) for C, or Sophia (Champin, 2020) for Rust.However these libraries do not scale to large RDF graphs due to their excessive memory usage, see Figure 3.To complement hdt-cpp (Arias et al., 2023) and hdt-java (Torres et al., 2022), we implement HDT in Rust, which is a popular modern, statically typed high-level programming language that allows writing performant software while ensuring memory safety, which meets the challenges of Semantic Web adoption.hdt-rs is used by the RDF browser RickView (Höffner, 2023) via the included Sophia adapter to publish large graphs, for example LinkedSpending (Höffner et al., 2016) at https://linkedspending.aksw.org,which previously suffered from frequent downtime when based on a SPARQL endpoint.librdf was not benchmarked on 10 6 triples and beyond due to graph loading times exceeding several hours.hdt-java produces DelayedString instances that are converted to strings to account for the time that would otherwise be spent later.The index files created by hdt-java and hdt-cpp produce are deleted before each run.Versions: Apache Jena 4.6.1,n3.js 1.6.3,librdf 1.0.17,RDFlib 6.2.0, sophia 0.8.0-alpha, hdt-rs 0.0.13-alpha,hdt-java 3.0.9,hdt-cpp master fbcb31a, OpenJDK 19, Node.js 16.18.0,clang 14.0.6,Python 3.10.8,rustc 1.69.0-nightly(target-cpu=native), GCC 12.2.1.

Table 1:
Rounded averages over four runs on the complete person data dataset containing 10310105 triples (rightmost points in Figure 3) serialised as a 90 MB HDT and 1.2 GB RDF Turtle file.Sorted by memory usage of the graph.For better comparison, results for hdt_java are given both with and without calling DelayedString::toString on the results.The measured values are subject to considerable fluctuations, see the vertical bars in Figure 3.

Library
Memory in MB Load  1 demonstrates the advantage of HDT libraries in memory usage, with hdt_cpp using only 112 MB compared to 834 MB for the most memory-efficient non-HDT RDF library tested, sophia_lg (LightGraph).When comparing only Rust libraries, sophia_lg still uses over three times as much memory as hdt_rs.The memory consumption is calculated by comparing the resident set size before and after graph loading and index generation, with the caveat that the memory usage may be higher during graph loading.Converting other formats to HDT in the first place is also a time and memory-intensive process.The uncompressed and fully indexed Sophia FastGraph (sophia) strongly outperforms the HDT libraries in ?PO query time, with 20ms compared to 214ms respectively 321ms for hdt_java.While being the fastest querying HDT library in this test, hdt_java has a large memory usage for an HDT library placing it closer to the much faster sophia_lg.The large overhead on small graph sizes for hdt_java in Figure 3 suggests that with larger graph sizes, these considerations might yield different results.In fact, HDT allows loading much larger datasets, but at that point, several of the tested libraries could not have been included, such as rdflib, which already uses over 14 GB of memory to load the ~10 million triples.hdt_rs achieves the lowest graph-loading time with 912ms compared to more than 11s for the fastest-loading non-HDT library sophia_lg.hdt_cpp and hdt_java can speed up loading by reusing previously saved indexes, but these were deleted between runs to achieve consistent measurements.

Examples
Further examples are available in the API documentation and in the code repository.

Figure 1 :
Figure 1: The Bitmap Triples structure represents the adjacency matrix of the RDF graph as trees.Image source and further information in Martínez-Prieto et al. (2012).

Figure 3 :
Figure 3: Dataset load time, memory usage (resident set size), and ?PO triple pattern query time of different RDF libraries on an Intel i9-12900k CPU based on the benchmark suite ofChampin (2020).librdf was not benchmarked on 10 6 triples and beyond due to graph loading times exceeding several hours.hdt-java produces DelayedString instances that are converted to strings to account for the time that would otherwise be spent later.The index files created by hdt-java and hdt-cpp produce are deleted before each run.Versions: Apache Jena 4.6.1,n3.js 1.6.3,librdf 1.0.17,RDFlib 6.2.0, sophia 0.8.0-alpha, hdt-rs 0.0.13-alpha,hdt-java 3.0.9,hdt-cpp master fbcb31a, OpenJDK 19, Node.js 16.18.0,clang 14.0.6,Python 3.10.8,rustc 1.69.0-nightly(target-cpu=native), GCC 12.2.1.

Table
Time in ms Query Time in ms