doped: Python toolkit for robust and repeatable charged defect supercell calculations

Defects are a universal feature of crystalline solids, dictating the key properties and performance of many functional materials. Given their crucial importance yet inherent difficulty in measuring experimentally, computational methods (such as DFT and ML/classical force-fields) are widely used to predict defect behaviour at the atomic level and the resultant impact on macroscopic properties. Here we report doped, a Python package for the generation, pre-/post-processing, and analysis of defect supercell calculations. doped has been built to implement the defect simulation workflow in an efficient and user-friendly -- yet powerful and fully-flexible -- manner, with the goal of providing a robust general-purpose platform for conducting reproducible calculations of solid-state defect properties.

in functional materials and the major advances in computational methodologies and resources facilitating their accurate simulation.Software which enables researchers to efficiently and accurately perform these calculations, while allowing for in-depth target analyses of the resultant data, is thus of significant value to the community.Indeed there are many critical stages in the computational workflow for defects, which when performed manually not only consume significant researcher time and effort but also leave room for human error -particularly for newcomers to the field.Moreover, there are growing efforts to perform high-throughput investigations of defects in solids (Broberg et al., 2023;Xiong et al., 2023;Yuan et al., 2024), necessitating robust, user-friendly, and efficient software implementing this calculation workflow.
Given this importance of defect simulations and the complexity of the workflow, a number of software packages have been developed with the goal of managing preand post-processing of defect calculations, including work on the HADES/METADISE codes from the 1970s (Parker et al., 2004), to more recent work from Kumagai et al. ( 2021), Broberg et al. (2018), Shen & Varley (2024), Neilson & Murphy (2022), Arrigoni & Madsen (2021), Goyal et al. (2017), M. Huang et al. (2022), Péan et al. (2017) and Naik & Jain (2018).1While each of these codes have their strengths, they do not include the full suite of functionality provided by dopedsome of which is discussed below -nor adopt the same focus on user-friendliness (along with sanity-checking warnings and error catching) and efficiency with full flexibility and wide-ranging functionality, targeting expert-level users and newcomers to the field alike.doped doped is a Python package for the generation, pre-/post-processing, and analysis of defect supercell calculations, as depicted in Figure 1.The design philosophy of doped has been to implement the defect simulation workflow in an efficient, reproducible, and user-friendly -yet powerful and fully-customisable -manner, combining reasonable defaults with full user control for each parameter in the workflow.As depicted in Figure 1, the core functionality of doped is the generation of defect supercells and competing phases, writing calculation input files, parsing calculation outputs, and analysing/plotting defect-related properties.This functionality and recommended usage of doped is demonstrated in the tutorials on the documentation website.
Some key advances of doped include: • Supercell Generation: When choosing a simulation supercell for charged defects in materials, we typically want to maximise the minimum distance between periodic images of the defect (to reduce finite-size errors) while keeping the supercell to a tractable number of atoms/electrons to calculate.
Common approaches are to choose a near-cubic integer expansion of the unit cell (Ong et al., 2013), or to use a cell shape metric to search for optimal supercells (Larsen et al., 2017).Building on these and instead integrating an efficient algorithm for calculating minimum image distances, doped directly optimises the supercell choice for this goal -often identifying non-trivial 'root 2'/'root 3' type supercells.As illustrated in Figure 2a, this leads to a significant reduction in the supercell size (and thus computational cost) required to achieve a threshold minimum image distance.
-Over a test set of simple cubic, trigonal, orthorhombic, monoclinic and face-centred cubic unit cells, the doped algorithm is found to give mean improvements of 35.2%, 9.1% and 6.7% in the minimum image distance for a given (maximum) number of unit cells as compared to the pymatgen cubic supercell algorithm, the ASE optimal cell shape algorithm with simple-cubic target shape, and ASE with FCC target shape respectively -in the range of 2-20 unit cells.For 2-50 unit cells (for which the mean values across this test set are plotted in Figure 2a), this becomes 36.0%,9.3% and 5.6% respectively.Given the approximately cubic scaling of DFT computational cost with the number of atoms, these correspond to significant reductions in cost (~20-150%).-As always, the user has full control over supercell generation in doped, with the ability to specify/adjust constraints on the minimum image distance, number of atoms or transformation matrix, or to simply provide a pre-generated supercell if desired.
• Charge-state Estimation: Defects in solids can adopt various electronic charge states.However, the set of stable charge states for a given defect is typically not known a priori, so one must choose a set of possible defect charge states to calculate -usually relying on some form of chemical intuition.In this regard, extremal defect charge states that are calculated but do not end up being stable can be considered 'false positives' or 'wasted' calculations,2 while charge states which are stable but were not calculated can be considered 'false negatives' or 'missed' calculations.doped builds on other routines which use known elemental oxidation states to additionally account for oxidation state probabilities, the electronic state of the host crystal and charge state magnitudes.Implementing these features in a simple cost function, we find a significant improvement in terms of both efficiency (reduced false positives) and completeness (reduced false negatives) for this charge state estimation, as shown in Figure 2b. 3gain, this step is fully-customisable.The user can tune the probability threshold at which to include charge states or manually specify defect charge states.All probability factors computed are available to the user and saved to the defect JSON files for full reproducibility.
• Efficient Competing Phase Selection: Elemental chemical potentials (a key term in the defect formation energy) are limited by the secondary phases which border the host compound on the phase diagram.These bordering phases are known as competing phases, and their total energies must be calculated to determine the chemical potential limits.Only the elemental reference phases and compounds which border the host on the phase diagram need to be calculated, rather than the full phase diagram.
doped aims to improve the efficiency of this step by querying the Materials Project database (containing both experimentally-measured and theoretically-predicted crystal structures), and pulling only compounds which could border the host material within a user-specified error tolerance for the semi-local DFT database energies (0.1 eV/atom by default), along with the elemental reference phases.The necessary k-point convergence step for these compounds is also implemented in a semi-automated fashion to expedite this process.
-With the parsed chemical potentials in doped, the user can easily select various X-poor/rich chemical conditions, or scan over a range of chemical potentials (growth conditions) as shown in Figure 2e,h.
• Automated Symmetry & Degeneracy Handling: doped automatically determines the point symmetry of both initial (un-relaxed) and final (relaxed) defect configurations, and computes the corresponding orientational (and spin) degeneracy factors.This functionality is also offered in the form of standalone functions which do not require the defect calculations to have been generated/parsed with doped.This is a key pre-factor in the defect concentration equation: where g is the product of all degeneracy factors, N s is the concentration of lattice sites for that defect, E f is the defect formation energy and N D is the defect concentration.g can affect predicted defect/carrier concentrations by up to two or three orders of magnitude (Kavanagh, Scanlon, et al., 2022;Mosquera-Lois, Kavanagh, Klarbring, et al., 2023), and is often overlooked in defect calculations, partly due to the (previous) requirement of significant manual effort and knowledge of group theory.
• Automated Compatibility Checking: When parsing defect calculations, doped automatically checks that calculation parameters which could affect the defect formation energy (e.g.k-point grid, energy cutoff, pseudopotential choice, exchange fraction, Hubbard U etc.) are consistent between the defect and reference calculations.This is a common source of accidental error in defect calculations, and doped provides informative warnings if any inconsistencies are detected.
• Thermodynamic Analysis: doped provides a suite of flexible tools for the analysis of defect thermodynamics, including formation energy diagrams (Figure 2d), equilibrium & non-equilibrium Fermi level solving (Figure 2f), doping analysis (Figure 2g,h), Brouwer-type diagrams etc.These include physically-motivated (but tunable) grouping of defect sites, full inclusion of metastable states, support for complex system constraints, optimisation over high-dimensional chemical & temperature space and highly customisable plotting.In-depth examples are provided in the tutorials.
• Reproducibility & Tabulation: doped has been built to support and encourage reproducibility, with all input parameters and calculation results saved to lightweight JSON files.This allows for easy sharing of calculation inputs/outputs and reproducible analysis.Several tabulation functions are also provided to facilitate the quick summarising of key quantities as exemplified in the tutorials (including defect formation energy contributions, charge transition levels (with/without metastable states), symmetry, degeneracy and multiplicity factors, defect/carrier concentrations, chemical potential limits, dopability limits, doping windows. . . ) to aid transparency, reproducibility, comparisons with other works, and general analysis.The use of these tabulated outputs in supporting information of publications is encouraged.

CRediT Author Contributions
Seán R.

Figure 1 :
Figure 1: Schematic workflow of a computational defect investigation using doped.

Figure 2 :
Figure 2: Performance and example outputs from doped.(a) Average minimum periodic image distance, normalised by the ideal image distance (i.e. for a closepacked face-centred cubic (FCC) cell), vs. number of unit cells for supercell generation algorithms in doped, ASE, and pymatgen."SC" = simple cubic and "HCP" = hexagonal close-packed.(b) Average performance of various charge state estimation routines."ICSD probabilities" refers to a model based oxidation state probabilities, as given by their occurrence in the ICSD database.Asterisk indicates that pyCDT "false negatives" are underestimated as the majority of this test set used the pyCDT charge state ranges."Ox.state" = oxidation state.Example (c) Kumagai-Oba (eFNV) finite-size correction plot, (d) defect formation energy diagram, (e) chemical potential / stability region, (f) Fermi level vs. annealing temperature, (g) defect/carrier concentrations vs. annealing temperature and (h) Fermi level / carrier concentration heatmap plots from doped.Automated plots of single-particle eigenvalues from DFT supercell calculations for (i) V 0 Cu in Cu2SiSe3 and (j) V −1 Cd in CdTe.(k) Automated site displacement analysis, plotting atomic displacements with respect to the defect site against distance to the defect site, for V −1 Cd in CdTe.Data and code to reproduce these plots is provided in the docs/JOSS folder of the doped GitHub repository.