SMACT : Semiconducting Materials by Analogy and Chemical Theory

License Authors of papers retain copyright and release the work under a Creative Commons Attribution 4.0 International License (CC-BY). The paradigm of data-driven science is revolutionising the materials discovery process. There are now many databases containing experimental and calculated materials properties and extensive codes available for applying data mining, machine learning, and other statistical approaches (a well-maintained list is available here). While we use these tools to push forward in the quest to learn as much as we can from existing materials, it is becoming clear that the search space for new materials remains relatively uncharted.

The paradigm of data-driven science is revolutionising the materials discovery process.There are now many databases containing experimental and calculated materials properties and extensive codes available for applying data mining, machine learning, and other statistical approaches (a well-maintained list is available here).While we use these tools to push forward in the quest to learn as much as we can from existing materials, it is becoming clear that the search space for new materials remains relatively uncharted.
The discovery of new chemical compounds (combinations of elements arranged in a particular way in space) underpins materials discovery.The smact Python library is designed to facilitate a top-down approach where sets of element combinations are generated then screened using chemical filters.It is possible to screen for candidates that make "chemical sense" according to the well-established principles of electron valence and charge neutrality.The methodology is inspired by the seminal work of Goodman and Pamplin who carried out similar procedures by hand, predicting the existence of new semiconductors by analogy with existing compounds (Goodman, 1958;Pamplin, 1964).
Once a set of compositions is generated, further functions built into smact can be used to filter for candidates with target properties using data-driven models.These functions can predict key electronic structure properties such as the optical band gap using the solid-state energy scale (Pelatt, Ravichandran, Wager, & Keszler, 2011), evaluate sustainability metrics using the Herfidahl-Hirschman Index of resource availability (Gaultois et al., 2013), and predict stability using a statistical oxidation states model (D.W. Davies, Butler, Isayev, & Walsh, 2018).

Core components:
The element and species classes are at the heart of smact.Elements are elements of the periodic table.Species are elements in a particular oxidation state and (optionally) coordination environment.These classes provide access to tabulated data and the properties of these objects are leveraged by the screening functions.For example, atomic radii can be used in the application of radius-ratio rules (Goldschmidt, 1929) and electronegativities can be used to estimate electronic properties (Nethercot, 1974).In a typical workflow, screening functions are applied to lists of elements or species sets.While other chemistry toolkits such as OpenBabel (O' Boyle et al., 2011), the Atomic Simulation Environment (ASE) (Larsen et al., 2017) and Pymatgen (Ong et al., 2013) can also be used to access tabulated element data, smact is distinctive in that it primarily deals with chemical composition and associated properties, as opposed to molecular or crystal structure.

High-throughput workflows:
The number of possible element combinations is enormous, exceeding 4×10 12 for four-component compounds (D.W. Davies et al., 2016).For this reason, functions from smact can be applied at low computational cost to facilitate the screening of vast areas of chemical space rapidly on a desktop computer.This is made possible by (i) a data_loader module which implements a data-caching system to avoid a large amount of I/O and (ii) using Python's built-in multiprocessing library, as shown in the example workflows.
Interfacing to machine learning: Materials design is begining to benefit from the development of powerful machine learning techniques, with many supervised learning models being built to predict important properties (K.T. Butler, Davies, Cartwright, Isayev, & Walsh, 2018).The smact library can provide a large, unseen chemical space to which trained models can be applied.The compositions generated by smact can be featurised using the matminer Python library (Ward et al., 2018) or converted to objects used in Pymatgen.