matador: a Python library for analysing, curating and performing high-throughput density-functional theory calculations

The properties of materials depend heavily on their atomistic structure; knowledge of the possible stable atomic configurations that define a material is required to understand the performance of many technologically and ecologically relevant devices, such as those used for energy storage (A. F. Harper et al., 2020; Marbella et al., 2018). First-principles crystal structure prediction (CSP) is the art of finding these stable configurations using only quantum mechanics (A. F. Harper, Evans, et al., 2020). Density-functional theory (DFT) is a ubiquitous theoretical framework for finding approximate solutions to quantum mechanics; calculations using a modern DFT package are sufficiently robust and accurate that insight into real materials can be readily obtained. The computationally intensive work is performed by well-established, low-level software packages, such as CASTEP (Clark et al., 2005) or Quantum Espresso (Giannozzi et al., 2009), which are able to make use of modern highperformance computers. In order to use these codes easily, reliably and reproducibly, many high-level libraries have been developed to create, curate and manipulate the calculations from these low-level workhorses; matador is one such framework.


Summary
The properties of materials depend heavily on their atomistic structure; knowledge of the possible stable atomic configurations that define a material is required to understand the performance of many technologically and ecologically relevant devices, such as those used for energy storage (A. F. Marbella et al., 2018). First-principles crystal structure prediction (CSP) is the art of finding these stable configurations using only quantum mechanics (A. F. . Density-functional theory (DFT) is a ubiquitous theoretical framework for finding approximate solutions to quantum mechanics; calculations using a modern DFT package are sufficiently robust and accurate that insight into real materials can be readily obtained. The computationally intensive work is performed by well-established, low-level software packages, such as CASTEP (Clark et al., 2005) or Quantum Espresso (Giannozzi et al., 2009), which are able to make use of modern highperformance computers. In order to use these codes easily, reliably and reproducibly, many high-level libraries have been developed to create, curate and manipulate the calculations from these low-level workhorses; matador is one such framework.

Statement of need
The purpose of matador is fourfold: • to promote the use of local databases and high-throughput workflows to increase the reproducibility of the computational results, • to perform reliable analysis of the stability, structure and properties of materials derived from calculations, • to provide tools to create customisable, publication-quality plots of phase diagrams, spectral properties and electrochemistry, • to make the above functionality available to those with limited programming experience. matador matador is a Python 3.6+ library and set of command-line tools for performing and analysing high-throughput DFT calculations using the CASTEP (Clark et al., 2005) and Quantum Espresso (Giannozzi et al., 2009) packages. It is well-tested and fully-documented at ReadThe-Docs, and comes with several tutorials and examples. The package is available on PyPI under the name matador-db. As with many projects, matador is built on top of the scientific Python ecosystem of NumPy (Harris et al., 2020), SciPy  and matplotlib (Hunter, 2007).
matador has been developed with high-throughput CSP in mind and has found use in the application of CSP to energy storage materials Marbella et al., 2018); in this use case, a single compositional phase diagram can consist of tens of thousands of structural relaxation calculations. This package is aimed at users of CASTEP or Quantum Espresso who are comfortable with the command-line, yet maybe lack the Python knowledge required to start from scratch with more sophisticated packages. There are many mature packages that provide overlapping functionality with matador, the most widespread of which being the Atomic Simulation Environment (ASE) (Larsen et al., 2017) and pymatgen (Ong et al., 2013). A translation layer to and from the structure representation of both of these packages is provided, such that analysis can be reused and combined.

Overview of functionality
There are two ways of working with matador, either from the command-line interface (CLI) or through the Python library directly, with some features that are unique to each. The functionality of matador can be broadly split into three categories:

Creation and curation of databases of the results of first-principles calculations.
matador allows for the creation of MongoDB databases of CASTEP (6.0+) geometry optimisations from the command-line, using matador import. Only calculations that are deemed "successful", and that have fully-specified inputs are stored, with errors displayed for the rest. The resulting database can be queried with matador query, either with Python or through the powerful CLI. The results can be filtered for structural "uniqueness" and written to one of several supported file types and exported for use in other frameworks, such as ASE or pymatgen. Prototyping of structures directly from the database is achieved using matador swaps, which uses the same interface as matador query to return structure files with "swapped" elements (Marbella et al., 2018).

High-throughput calculations and automated workflows.
The run3 executable bundled with matador allows for high-throughput calculations to be performed with little setup and no programming knowledge. Specialised support for CASTEP and the post-processing tool OptaDOS (Morris, Nicholls, Pickard, & Yates, 2014;Nicholls, Morris, Pickard, & Yates, 2012) is provided to perform high-throughput geometry optimisations, orbital-projected band structures and densities of states, phonon calculations and elastic properties, however run3 can also be used to run generic MPI programs concurrently on a set of structures. Sensible defaults for these workflows are provided by leveraging the open-source SeeK-path (Hinuma, Pizzi, Kumagai, Oba, & Tanaka, 2017) and spglib (Togo & Tanaka, 2018) libraries. The bundled dispersion script and associated library functionality allows for the creation of publication-quality spectral and vibrational property plots, in a similar fashion to the sumo package (Ganose, Jackson, & Scanlon, 2018). The matador.compute module behind run3 also powers the ilustrado genetic algorithm code .

Stability and structural analysis (with an emphasis on battery materials).
The construction of reliable compositional phase diagrams requires several independent calculations to be performed on different atomic configurations with a compatible set of external parameters. These can be generated from a database query using matador hull, which allows the user to filter between different sets of calculations, and, where relevant, matador voltage can provide the electrochemical properties of that same phase diagram. Structural fingerprints implemented include pair distribution functions, powder X-ray diffraction patterns, and periodic crystal bond graphs. As more calculations are performed, changes to phase diagrams stored in the local database can be tracked with matador hulldiff. Phase diagrams can also be constructed from multiple energy values per structure, for example to show the effects of finite temperature , or in the specific case of ensemble-based exchangecorrelation functionals like the Bayesian Error Estimate Functional (BEEF) (Mortensen et al., 2005). An example of a ternary phase diagram is shown Figure 1.