The pdb2sql Python Package: Parsing, Manipulation and Analysis of PDB Files Using SQL Queries

The analysis of biomolecular structures is a crucial task for a wide range of applications ranging from drug design to protein engineering. The Protein Data Bank (PDB) file format (Burley et al., 2019) is the most popular format to describe biomolecular structures such as proteins and nucleic acids. In this text-based format, each line represents a given atom and entails its main properties such as atom name and identifier, residue name and identifier, chain identifier, coordinates, etc. Several solutions have been developed to parse PDB files into dedicated objects that facilitate the analysis and manipulation of biomolecular structures. This is, for example, the case for the BioPython parser (Cock et al., 2009,@biopdb) that loads PDB files into a nested dictionary, the structure of which mimics the hierarchical nature of the biomolecular structure. Selecting a given sub-part of the biomolecule can then be done by going through the dictionary and selecting the required atoms. Other packages, such as ProDy (Bakan, Meireles, & Bahar, 2011), BioJava (Lafita, 2019), MMTK (Hinsen, 2000) and MDAnalysis (Gowers et al., 2016) to cite a few, also offer solutions to parse PDB files. However, these parsers are embedded in large codebases that are sometimes difficult to integrate with new applications and are often geared toward the analysis of molecular dynamics simulations. Lightweight applications such as pdb-tools (Rodrigues, Teixeira, Trellet, & Bonvin, 2018) lack the capabilities to manipulate coordinates.


Summary
The analysis of biomolecular structures is a crucial task for a wide range of applications ranging from drug design to protein engineering. The Protein Data Bank (PDB) file format (Burley et al., 2019) is the most popular format to describe biomolecular structures such as proteins and nucleic acids. In this text-based format, each line represents a given atom and entails its main properties such as atom name and identifier, residue name and identifier, chain identifier, coordinates, etc. Several solutions have been developed to parse PDB files into dedicated objects that facilitate the analysis and manipulation of biomolecular structures. This is, for example, the case for the BioPython parser (Cock et al., 2009,@biopdb) that loads PDB files into a nested dictionary, the structure of which mimics the hierarchical nature of the biomolecular structure. Selecting a given sub-part of the biomolecule can then be done by going through the dictionary and selecting the required atoms. Other packages, such as ProDy (Bakan, Meireles, & Bahar, 2011), BioJava (Lafita, 2019), MMTK (Hinsen, 2000) and MDAnalysis (Gowers et al., 2016) to cite a few, also offer solutions to parse PDB files. However, these parsers are embedded in large codebases that are sometimes difficult to integrate with new applications and are often geared toward the analysis of molecular dynamics simulations. Lightweight applications such as pdb-tools (Rodrigues, Teixeira, Trellet, & Bonvin, 2018) lack the capabilities to manipulate coordinates.
We present here the Python package pdb2sql, which loads individual PDB files into a relational database. Among different solutions, the Structured Query Language (SQL) is a very popular solution to query a given database. However SQL queries are complex and domain scientists such as bioinformaticians are usually not familiar with them. This represents an important barrier to the adoption of SQL technology in bioinformatics. pdb2sql exposes complex SQL queries through simple Python methods that are intuitive for end users. As such, our package leverages the power of SQL queries and removes the barrier that SQL complexity represents. In addition, several advanced modules have also been built, for example, to rotate or translate biomolecular structures, to characterize interface contacts, and to measure structure similarity between two protein complexes. Additional modules can easily be developed following the same scheme. As a consequence, pdb2sql is a lightweight and versatile PDB tool that is easy to extend and to integrate with new applications.

Capabilities of pdb2sql
pdb2sql allows a user to query, manipulate, and process PDB files through a series of dedicated classes. We give an overview of these features and illustrate them with snippets of code. More examples can be found in the documentation (https://pdb2sql.readthedocs.io).

Extracting data from PDB files
pdb2sql allows a user to simply query the database using the get(attr, **kwargs) method. The attribute attr here is a list of or a single column name of the SQL database; see Table  1 for available attributes. The keyword argument kwargs can then be used to specify a sub-selection of atoms. This snippet extracts the coordinates of the carbon and hydrogen atoms that belong to all the valine and leucine residues of the chain labelled A in the PDB file. Atoms can also be excluded from the selection by appending the prefix no_ to the attribute name. This is the case in the following example: This snippet extracts the atom and residue names of all atoms except those belonging to the glycine and phenylalanine residues of the structure. Similar combinations of arguments can be designed to obtain complex selection rules that precisely select the desired atom properties.

Manipulating PDB files
The data contained in the SQL database can also be modified using the update(attr, v als, **kwargs) method. The attributes and keyword arguments are identical to those in the get method. The vals argument should contain a numpy array whose dimension should match the selection criteria. For example: import numpy as np from pdb2sql import pdb2sql pdb = pdb2sql('1AK4.pdb') xyz = pdb.get('x,y,z', chainID='A', resSeq=1) xyz = np.array(xyz) xyz -= np.mean(xyz) pdb.update('x,y,z', xyz, chainID='A', resSeq=1) This snippet first extracts the coordinates of atoms in the first residue of chain A, then translates this fragment to the origin and updates the coordinate values in the database. pdb2sql also provides a convenient class transform to easily translate or rotate structures. For example, to translate the first residue of the structure 5 Å along the Y-axis, import numpy as np from pdb2sql import pdb2sql from pdb2sql import transform pdb = pdb2sql('1AK4.pdb') trans_vec = np.array([0,5,0]) transform.translation(pdb, trans_vec, resSeq=1, chainID='A') One can also rotate a given selection around a given axis with the rotate_axis method: angle = np.pi axis = (1., 0., 0.) transform.rot_axis(pdb, axis, angle, resSeq=1, chainID='A')

Identifying interface
The interface class is derived from the pdb2sql class and offers functionality to identify contact atoms or residues between two different chains with a given contact distance. It is useful for extracting and analysing the interface of, e.g., protein-protein complexes. The following example snippet returns all the atoms and all the residues of the interface of '1AK4.pdb' defined by a contact distance of 6 Å.
from pdb2sql import interface pdb = interface('1AK4.pdb') atoms = pdb.get_contact_atoms(cutoff=6.0) res = pdb.get_contact_residues(cutoff=6.0) It is also possible to directly create an interface instance with a pdb2sql instance as input. In this case, all the changes in the pdb2sql instance before creating the new inter face instance will be kept in the interface instance; afterwards, the two instances will be independent, which means changes in one will not affect the other.

Computing Structure Similarity
The StructureSimilarity class allows a user to compute similarity measures between two protein-protein complexes. Several popular measures used to classify qualities of protein complex structures in the CAPRI (Critical Assessment of PRedicted Interactions) challenges (Méndez, Leplae, Maria, & Wodak, 2003) have been implemented: interface rmsd, ligand rmsd, fraction of native contacts and DockQ (Basu & Wallner, 2016). The approach implemented to compute the interface rmsd and ligand rmsd is identical to the well-known package ProFit (Martin & Porter, 2009 Application psb2sql has been used at the Netherlands eScience center for bioinformatics projects. This is, for example, the case of iScore (Geng et al., 2019), which uses graph kernels and support vector machines to rank protein-protein interfaces. We illustrate the use of the package here by computing the interface rmsd and ligand rmsd of a series of structural models using the experimental structure as a reference. This is a common task for protein-protein docking, where a large number of docked conformations are generated and have then to be compared to ground truth to identify the best-generated poses. This calculation is usually done using the ProFit software and we, therefore, compare our results with those obtained with ProFit. The code to compute the similarity measure for different decoys is simple: Note that the method will compute the i-zone, i.e., the zone of the proteins that form the interface in a similar way to ProFit. This is done for the first calculations and the i-zone is