DeepRank2: Mining 3D Protein Structures with Geometric Deep Learning

DeepRank2, a deep

Figure 1: DeepRank2 framework overview.3D coordinates of protein structures are extracted from PDB files and converted into graphs, using either an atomic or a residue level, depending on the user's requirements.Then, the data are enriched with geometrical and physicochemical information, and eventually mapped to 3D grids, before finally storing them into HDF5 files.The processed data can be used in the pre-implemented DL pipeline for training PyTorch networks and computing predictions.

State of the field
The 3D structure of proteins and protein complexes provides fundamental information to understand biological processes at the molecular scale.Exploiting or engineering these molecules is key for many biomedical applications such as drug design (Gane & Dean, 2000), immunotherapy (Sadelain et al., 2013), or designing novel proteins (Liu et al., 2007).For example, PPI data can be harnessed to address critical challenges in the computational prediction of peptides presented on the major histocompatibility complex (MHC) protein, which play a key role in T-cell immunity.Protein structures can also be exploited in molecular diagnostics for the identification of SRVs, that can be pathogenic sequence alterations in patients with inherited diseases (B.Li et al., 2020;Shroff et al., 2020).
In the past decades, a variety of experimental methods (e.g., X-ray crystallography, nuclear magnetic resonance, cryogenic electron microscopy) have determined and accumulated a large number of atomic-resolution 3D structures of proteins and protein-protein complexes (Schwede, 2013).Since experimental determination of structures is a tedious and expensive process, several computational prediction methods have been developed over the past decades, exploiting classical molecular modelling (Baek et al., 2021;Dominguez et al., 2003;Sanchez & Sali, 1997), and, more recently, DL (Jumper et al., 2021;Richard Evans, 2021).The large amount of data available makes it possible to use DL to leverage 3D structures and learn their complex patterns.Unlike other machine learning (ML) techniques, deep neural networks hold the promise of learning from millions of data points without reaching a performance plateau quickly, which is made computationally feasible by hardware accelerators (i.e., GPUs, TPUs) and parallel file system technologies.
The main types of data structures in vogue for representing 3D structures are 3D grids, graphs, and surfaces.3D CNNs have been trained on 3D grids for the classification of biological vs. crystallographic PPIs (Renaud et al., 2021), and for the scoring of models of protein-protein complexes generated by computational docking (Renaud et al., 2021;Wang et al., 2020).Gaiza et al. have applied geodesic CNNs to extract protein interaction fingerprints by applying 2D CNNs on spread-out protein surface patches (Gainza et al., 2023).3D CNNs have been used for exploiting protein structure data for predicting mutation-induced changes in protein stability (B.Li et al., 2020;Ramakrishnan et al., 2023) and identifying novel gain-of-function mutations (Shroff et al., 2020).Contrary to CNNs, in GNNs the convolution operations on graphs can rely on the relative local connectivity between nodes and not on the data orientation, making graphs rotationally invariant.Additionally, GNNs can accept any size of graph, while in a CNN the size of the 3D grid for all input data needs to be the same, which may be problematic for datasets containing highly variable in size structures.Based on these arguments, different GNN-based tools have been designed to predict patterns from PPIs (Fout et al., 2017;Réau et al., 2022;Wang et al., 2021).Eisman et al. developed a rotation-equivariant neural network trained on point-based representation of the protein atomic structure to classify PPIs (Eismann et al., 2021).

Statement of need
Data mining 3D structures of proteins presents several challenges.These include complex physico-chemical rules governing structural features, the possibility of characterization at different scales (e.g., atom-level, residue level, and secondary structure level), and the large diversity in shape and size.Furthermore, because a structure can easily comprise of hundreds to thousands of residues (and ~15 times as many atoms), efficient processing and featurization of many structures is critical to handle the computational cost and file storage requirements.Existing software solutions are often highly specialized and not developed as reusable and flexible frameworks, and cannot be easily adapted to diverse applications and predictive tasks.Examples include DeepAtom (Y.Li et al., 2019) for protein-ligand binding affinity prediction only, and MaSIF (Gainza et al., 2023) for deciphering patterns in protein surfaces.While some frameworks, such as TorchProtein and TorchDrug (Zhu et al., 2022), configure themselves as general-purpose ML libraries for both molecular sequences and 3D structures, they only implement geometric-related features and do not incorporate fundamental physico-chemical information in the 3D representation of molecules.
These limitations create a growing demand for a generic and flexible DL framework that researchers can readily utilize for their specific research questions while cutting down the tedious data preprocessing stages.Generic DL frameworks have already emerged in diverse scientific fields, such as computational chemistry (e.g., DeepChem (Ramsundar et al., 2019)) and condensed matter physics (e.g., NetKet (Vicentini et al., 2022)), which have promoted collaborative efforts, facilitated novel insights, and benefited from continuous improvements and maintenance by engaged user communities.

Key features
DeepRank2 allows to transform and store 3D representations of both PPIs and SRVs into 3D grids or graphs containing both geometric and physico-chemical information, and provides a DL pipeline that can be used for training pre-implemented neural networks for a given pattern of interest to the user.DeepRank2 is an improved and unified version of three previously developed packages: DeepRank, DeepRank-GNN, and DeepRank-Mut.
As input, DeepRank2 takes PDB-formatted atomic structures, which is one of the standard and most widely used formats in the field of structural biology.These are mapped to graphs, where nodes can represent either residues or atoms, as chosen by the user, and edges represent the interactions between them.The user can configure two types of 3D structures as input for the featurization phase: • PPIs, for mining interaction patterns within protein-protein complexes; • SRVs, for mining mutation phenotypes within protein structures.
The physico-chemical and geometrical features are then computed and assigned to each node and edge.The user can choose which features to generate from several pre-existing options defined in the package, or define custom features modules, as explained in the documentation.Examples of pre-defined node features are the type of the amino acid, its size and polarity, as well as more complex features such as its buried surface area and secondary structure features.Examples of pre-defined edge features are distance, covalency, and potential energy.A detailed list of predefined features can be found in the documentation's features page.Graphs can either be used directly or mapped to volumetric grids (i.e., 3D image-like representations), together with their features.Multiple CPUs can be used to parallelize and speed up the featurization process.The processed data are saved into HDF5 files, designed to efficiently store and organize big data.Users can then use the data for any ML or DL framework suited for the application.Specifically, graphs can be used for the training of GNNs, and 3D grids can be used for the training of CNNs.
DeepRank2 also provides convenient pre-implemented modules for training simple PyTorchbased GNNs and CNNs using the data generated in the previous step.Alternatively, users can implement custom PyTorch networks in the DeepRank package (or export the data to external software).Data can be loaded across multiple CPUs, and the training can be run on GPUs.The data stored within the HDF5 files are read into customized datasets, and the user-friendly API allows for selection of individual features (from those generated above), definition of the targets, and the predictive task (classification or regression), among other settings.Then the datasets can be used for training, validating, and testing the chosen neural network.The final model and results can be saved using built-in data exporter modules.

DeepRank2 embraces the best practices of open-source development by utilizing platforms like
GitHub and Git, unit testing (as of August 2023 coverage is 83%), continuous integration, automatic documentation, and Findable, Accessible, Interoperable, and Reusable (FAIR) principles.Detailed documentation and tutorials for getting started with the package are publicly available.The project aims to create high-quality software that can be easily accessed, used, and contributed to by a wide range of researchers.
We believe this project will have a positive impact across the all of structural bioinformatics, enabling advancements that rely on molecular complex analysis, such as structural biology, protein engineering, and rational drug design.The target community includes researchers working with molecular complexes data, such as computational biologists, immunologists, and structural bioinformaticians.The existing features, as well as the sustainable package formatting and its modular design make DeepRank2 an excellent framework to build upon.Taken together, DeepRank2 provides all the requirements to become the all-purpose DL tool that is currently lacking in the field of biomolecular interactions.