libcdict: fast dictionaries in C

A common requirement in science is to store and share large sets of simulation data in an efficient, nested, flexible and human-readable way. Such datasets contain number counts and distributions, i.e. histograms and maps, of arbitrary dimension and variable type, e.g. floating-point number, integer or character string. Modern high-level programming languages like Perl and Python have associated arrays, knowns as dictionaries or hashes, respectively, to fulfil this storage need. Low-level languages used more commonly for fast computational simulations, such as C and Fortran, lack this functionality. We present libcdict, a C dictionary library, to solve this problem. Libcdict provides C and Fortran application programming interfaces (APIs) to native dictionaries, called cdicts, and functions for cdicts to load and save these as JSON and hence for easy interpretation in other software and languages like Perl, Python and R.


Statement of need
Users of high-level languages such as Perl or Python have access to associated-array data structures through dictionaries and hashes, respectively.These allow arbitrary data types to be stored in array-like structures.These are in turn accessed through key-value pairs which allow the value to be a further, nested associated array, allowing arbitrary nesting of data.Compiled low-level languages, like C and Fortran, are more suited to high-speed and repeated calculations typical in science.These languages lack native associated-array functionality.While there are pure hash-table solutions out there, such as glib (Glib, 2022) and uthash (Hansen, 2022), these do not combine a simple API for setting and adding to nested structures, a small library footprint, fast input and output, and standardised JSON output to easily interface with other languages and tools.libcdict provides an API for such functionality which allows cdicts to be nested in cdicts, hence arbitrarily-nested dictionaries of variables in C just as in Perl or Python.
libcdict is written in C and provides an API through a set of C macros.Nested cdict structures have values in them set with a single line of code.libcdict has been used for the last year in the binary_c single-and binary-star population nucleosynthesis framework (Izzard et al., 2004(Izzard et al., , 2006(Izzard et al., , 2009(Izzard et al., , 2018)).Recent works (Hendriks & Izzard, 2023b;Izzard & Jermyn, 2023;Mirouh et al., 2023;Yates et al., 2023) compute the evolution of millions of singleand binary-stellar systems in only a few hours using its binary_c-python Python frontend (Hendriks & Izzard, 2023a).We provide libcdict as open-source code on Gitlab subject to the GPL3.libcdict also has a comprehensive test suite run through its configuration program cdict-config.

Using libcdict
libcdict is flexible but pragmatic.Keys to cdicts can be any C scalar or pointer.Values can be scalars, pointers, arrays or other cdicts, but arrays must be of a single C type.Values can store metadata of arbitrary type.Pointer values are optionally garbage collected when a cdict is freed.A set of API macros provides simple nesting facilities so that placing a value in a nested location given a list of keys is a simple task for the C programmer.Issues such as C variable typing are automatically handled for the user.
Installation uses meson (Pakkanen, 2022) and ninja (Martin, 2022).libcdict has been tested with the GCC (10.3.0) and Clang (12.0.0) compilers.libcdict in stellar-population statistics calculations libcdict was developed to solve the problem of storing statistics in stellar-population calculations in binary_c.When evolving a population of millions, sometimes billions, of stars, each for thousands of time steps, enormous amounts of data are computed.It is impractical to output these data every time step as these are typically ∼ 10 6 × 10 4 = 10 10 lines, each of which can easily be ∼ 1 KB long.The data from each star could be sent to a Perl or Python front-end which merges them into a dictionary of population statistics.This communication between programming languages involves significant overhead which compares similarly to the runtime of the stellar code itself thus greatly increases runtime and cost.
To overcome this problem, binary_c internally generates an associative-array cdict in native C.This cdict, and the stellar statistics it contains, is filled inside the binary_c simulation as each star is simulated.Generation of the stellar-population data in the cdict is efficient because it is only in C and communication with the frontend (Python) code is kept to a minimum.The cdict's dataset is output only once, as human-readable JSON easily understood by Perl or Python, at the end of the simulation.Large simulations are often split across clusters of machines using binary_c-python.The data from each run are stored as JSON chunks then merged in Python when the final run completes.The overhead involved in this joining is small compared to the effort of simulating the stars: the goal of libcdict has thus been achieved.We provide an interactive example made with binary_c and binary_c-python using libcdict in its examples directory (Izzard, 2022).The libcdict JSON output of a Hertzsprung-Russell diagram, the most important diagnostic plot in stellar astrophysics, is plotted using Bokeh (Bokeh Development Team, 2014;Bokeh GitHub, 2022) to provide immediate access to nested data sets.