Foundry-ML - Software and Services to Simplify Access to Machine Learning Datasets in Materials Science

. Artificial Intelligence


License
Authors of papers retain copyright and release the work under a Creative Commons Attribution 4.0 International License (CC BY 4.0).

Summary
The application of open science and machine learning to scientific, engineering, and industryrelevant problems is a critical component of the cross-department U.S. Artificial Intelligence (AI) strategy highlighted e.g., by the AI Initiative, the recent National AI Strategy report ("Strengthening and Democratizing the u.s.Artificial Intelligence Innovation Ecosystem -an Implementation Plan for a National Artificial Intelligence Research Resource," 2023), the Year of Open Data, Materials Genome Initiative (Pablo et al., 2019;Ward & Warren, 2015), and more.A key aspect of these strategies is to ensure that infrastructure exists to make datasets easily accessible for training, retraining, reproducing, and verifying model performance on chosen tasks.However, the discovery of high-quality, curated datasets adhering to the FAIR principles (findable, accessible, interoperable and reusable) remains a challenge.
To overcome these dataset access challenges, we introduce Foundry-ML, software that combines several services to provide researchers capabilities to publish and discover structured datasets for ML in science, specifically in materials science and chemistry.Foundry-ML consists of a Python client, a web app, and standardized metadata and file structures built using services including the Materials Data Facility (Blaiszik et al., 2016(Blaiszik et al., , 2019) ) and Globus (Ananthakrishnan et al., 2018;Chard et al., 2015).Together, these services work in conjunction with Python software tooling to dramatically simplify data access patterns, as we show below.

Statement of need
The processes by which high-quality structured science datasets are published and accessed remains decentralized, without shared standards, and scattered with some exceptions (e.g., Wu et al. (2018)).With Foundry-ML, we provide 1) a simple Python interface that allows users to access structured ML-ready materials science and chemistry datasets with just a few lines of code, 2) a prototype web-based interface for dataset search and discovery, and 3) software that enables users to publish their own ML-ready datasets in a self-service manner.
Foundry-ML focuses foremost on accessibility and reproducibility.Figure 1 shows an example of how, with just a few lines of code, researchers can access a curated collection of ML-ready datasets, the associated metadata describing the dataset contents, split details (e.g., train, test, validate), and other information (e.g., number of entries).As of Q1 2023, we have collected and made available 30 datasets in Foundry with data representations including tabular data (e.g., csv, Excel), key-value data (e.g,JSON), image sets, and hierarchical data (e.g., HDF5).Foundry-ML is built upon a solid base.We have developed Foundry-ML using the Materials Data Facility (MDF) (Blaiszik et al., 2016(Blaiszik et al., , 2019) ) and Globus services like Auth, Transfer, and Search.Foundry-ML users can upload large datasets (MDF supports multi-TB databases, with potentially millions of files), making them easy to share, use, and discover by the rest of the scientific community.All datasets are made available through the Foundry-ML software, the Foundry-ML webapp and also via Globus endpoints that support both Globus and HTTPS access.
Beyond just simplified data access, enhanced interpretability is a key feature of Foundry-ML.Foundry-ML datasets have required metadata (see Figure 1b) that are provided by the authors of each dataset.All metadata are stored in Globus Search (Chard et al., 2015) to facilitate queries.To make these metadata easily usable by Foundry-ML users, query helpers are provided via the Foundry-ML Python client to perform common actions e.g., listing all datasets, selecting datasets by DOI, and more.
In addition to the Python software interface to each dataset, we have developed a prototype web interface (Figure 2) that lists all datasets with instructions on how to access them and key features of each dataset (e.g., number of entries, inputs, targets, type of data, tags, free text description).While the examples presented here come from the domains of materials science and chemistry, Foundry-ML is designed to be domain agnostic, and since similar problems exist in other domains, we expect these approaches to generalize.Generalizing to other domains will allow the same software and services to help solve similar problems across scientific domains.

Usage
Foundry has been successfully used in educational curricula (Stan et al., 2021) and to publish datasets by research teams at the University of Chicago, Argonne National Lab, the University of Toronto (Huang et al., 2022), 3M (Schneider et al., 2022), the University of Wisconsin (Li et al., 2021;Wei et al., 2021), MIT (Schwalbe-Koda et al., 2021) Figure 2, and many more.In Figure 2, we highlight a use case for the ML-guided design of organic structure-directing agents (OSDAs) to promote zeolite formation from the team of Gomez-Bombarelli at MIT.By using only the Foundry-ML software and the dataset DOI Figure 1a, which could be cited in a paper or retrieved from the Foundry-ML web app or software, a researcher can load descriptive metadata Figure 1b to understand the dataset contents, and load the data Figure 1c for analysis, exploration, and replication.A notebook showcasing this use case is available at in the GitHub examples linked in the Documentation section below.

Future Directions
In future work, we intend to add capabilities to Foundry-ML that enable publication and connection of datasets with ML models creating a combined ecosystem of datasets and models.This work will be completed in collaboration between two National Science Foundation (NSF) projects, (#1931306) "Collaborative Research: Framework: Machine Learning Materials Innovation Infrastructure" and (#2209892) "Garden: A FAIR Framework for Publishing and Applying AI Models for Translational Research in Science, Engineering, Education, and Industry".

Figure 1 :
Figure 1: A Foundry-ML use case for zeolite design.(a) A user instantiates the Foundry-ML Python client and loads the descriptive metadata using the DOI.(b) Descriptive metadata includes information about the keys included in the datasets, associated units, and a short description.The metadata also include information about the dataset including the associated splits (e.g., train, test, validate), and the amount of data included.(c) A user can then load the data using the load_data function.This function returns a Pandas or Dask dataframe for tabular data.The zeolite dataset shown here, its metadata, and the data itself from researchers Daniel Schwalba-Koda and Rafael Gomez-Bombarelli.

Figure 2 :
Figure 2: Foundry Website UI for browsing Datasets.This figure shows a web user interface for browsing available datasets with summary information about the datasets.