optimade-python-tools : a Python library for serving and consuming materials data via OPTIMADE APIs

1 Institut de la Matière Condensée et des Nanosciences, Université catholique de Louvain, Chemin des Étoiles 8, Louvain-la-Neuve 1348, Belgium 2 Theory of Condensed Matter Group, Cavendish Laboratory, University of Cambridge, J. J. Thomson Avenue, Cambridge, CB3 0HE, United Kingdom 3 Theory and Simulation of Materials (THEOS), Faculté des Sciences et Techniques de l’Ingénieur, École Polytechnique Fédérale de Lausanne, CH-1015 Lausanne, Switzerland 4 Lawrence Berkeley National Laboratory, Berkeley, CA, USA 5 Fritz-Haber-Institut der Max-Planck-Gesellschaft, Faradayweg 4-6, 14195, Berlin, Germany 6 Humboldt-Universität zu Berlin, Institut für Physik and IRIS Adlershof, 12489 Berlin, Germany 7 Polyneme LLC, New York, NY, USA 8 Department of Physics, King’s College London, Strand, London WC2R 2LS, United Kingdom 9 Department of Physics and Namur Institute of Structured Materials, University of Namur, Rue de Bruxelles 51, 5000 Namur, Belgium DOI: 10.21105/joss.03458


Summary
In recent decades, improvements in algorithms, hardware, and theory have enabled crystalline materials to be studied computationally at the atomistic level with great accuracy and speed. To enable dissemination, reproducibility, and reuse, many digital crystal structure databases have been created and curated, ready for comparison with existing infrastructure that stores structural characterizations (e.g., diffraction) of real crystals. Each database will typically have a bespoke, stateless, web-based Application Programming Interface (API); users can submit a query via specially-crafted URLs. Such esoteric and specialized APIs incur maintenance and usability costs upon both the data providers and consumers, who may not be software specialists.
The OPTIMADE API specification (Andersen et al., , 2021, released in July 2020, aimed to reduce these costs by designing a common API for use across a consortium of collaborating materials databases and beyond. Whilst based on the robust JSON:API standard (Katz et al., 2015), the OPTIMADE API specification presents several domain-specific features and requirements that can be tricky to implement for non-specialist teams. The repository presented here, optimade-python-tools, provides a modular reference server implementation and a set of associated tools to accelerate the development process for data providers, toolmakers and end-users.

Statement of need
In order to accommodate existing materials database APIs, the OPTIMADE API specification allows for flexibility in the specific data served, but enforces a simple yet domainspecific filter language on well-defined resources. However, this flexibility could be daunting to database providers, likely acting to increase the barrier to hosting an OPTIMADE API.
optimade-python-tools aims to catalyse the creation of APIs from existing and new data sources by providing a configurable and modular reference server implementation for hosting materials data in an OPTIMADE-compliant way. The repository hosts the optimade Python package, which leverages the modern Python libraries pydantic (Colvin & others, 2021) and FastAPI (Ramírez & others, 2021) to specify the data models and API routes defined in the OPTIMADE API specification, additionally providing a schema following the OpenAPI format (Miller et al., 2021). As this package was developed concomitantly with the OPTIMADE specification itself, the present authors are not aware of any other generic packages with similar functionality. Two storage back-ends are supported out of the box, with full filter support for databases that employ the popular MongoDB or Elasticsearch frameworks.

Functionality
The modular functionality of optimade can be broken down by the different stages of a user query to the reference server. Consider the following query URL to an OPTIMADE API, which should filter for any crystal structures in the database with a composition that consists of any three elements in a 1:1:1 ratio: https://example.org/v1/structures?filter=chemical_formula_anonymous="ABC" 1. After routing the query to the appropriate /structures/ endpoint adhering to version v1 of the specification, the filter string chemical_formula_anonymous="ABC" is tokenized and parsed into an abstract tree by a FilterParser object using the Lark parsing library (Shinan & others, 2021) against the formal grammar defined by the specification. 2. The abstract tree is then transformed by a FilterTransformer object into a database query specific to the configured back-end for the server. This transformation can include aliasing and custom transformations such that the underlying database format can be accommodated. 3. The results from the database query are then de-serialized by EntryResourceMapper objects into the OPTIMADE-defined data models and finally re-serialized into JSON before being served to the user over HTTP.
Beyond this query functionality, the package also provides: • A fuzzy implementation validator that performs HTTP queries against remote or local OPTIMADE APIs, with test queries and expected responses generated dynamically based on the data served at the introspective /info/ endpoint of the API implementation. • Entry "adapters" that can convert between OPTIMADE-compliant entries and the data models of popular Python libraries used widely in the materials science community: pymatgen , ASE (Larsen et al., 2017), AiiDA , and JARVIS (Choudhary et al., 2020).

Use cases
The package is currently used in production by three major data providers for materials science data: • The Materials Project  uses optimade-python-tools alongside their existing API (Ong et al., 2015) and MongoDB database, providing access to highlycurated density-functional theory calculations across all known inorganic materials. op timade-python-tools handles filter parsing, database query generation and response validation by running the reference server implementation with minimal configuration. • NOMAD (Ghiringhelli et al., 2017) uses optimade-python-tools as a library to extend its existing web app with OPTIMADE API routes. It uses the Elasticsearch implementation to filter millions of structures from published first-principles calculations provided by users and other projects. NOMAD also uses the filtering module in its own API to expose the OPTIMADE filter language in the user-centric web interface search bar. NOMAD uses a released version of optimade-python-tools and all necessary customization can be realized via configuration and sub-classing. • Materials Cloud  uses optimade-python-tools as a library to provide an OPTIMADE API entry to archived computational materials studies, created with the AiiDA  Python framework and published through their archive. In this case, each individual study and archive entry has its own database and separate API entry. The Python classes within the optimade package have been extended to make use of AiiDA and its underlying PostgreSQL storage engine. • The optimade.adapters module from the optimade-python-tools library is used in a graphical web client hosted on Materials Cloud (Andersen, 2021). It allows users to query OPTIMADE API implementations using user-friendly widgets as well as raw filter strings. The client uses the registry of known OPTIMADE providers to allow easy switching between databases. The crystal structures returned can be inspected visually and either downloaded in formats provided by conversion functions in the optimade.a dapters module, or used seamlessly within other Materials Cloud web tools, where the structure is automatically validated and transferred in the background, partly using the optimade.adapters module.