matchms - processing and similarity evaluation of mass spectrometry data

Mass spectrometry data is at the heart of numerable applications in the biomedical and life sciences. With growing use of high throughput techniques researchers need to analyse larger and more complex datasets. In particular through joint effort in the research community, fragmentation mass spectrometry datasets are growing in size and number. Platforms such as MassBank (Horai et al. 2010), GNPS (Wang et al. 2016) or MetaboLights (Haug et al. 2020) serve as an open-access hub for sharing of raw, processed, or annotated fragmentation mass spectrometry data (MS/MS). Without suitable tools, however, exploitation of such datasets remains overly challenging. In particular, large collected datasets contain data aquired using different instruments and measurement conditions, and can further contain a significant fraction of inconsistent, wrongly labeled, or incorrect metadata (annotations).

Identifiers (USI) (Wang et al. 2020). Further data formats or more extensive options regarding metadata parsing can best be handled by using pyteomics (Levitsky et al. 2019) or pymzml (Kösters et al. 2018). Matchms contains numerous metadata cleaning and harmonizing filter functions that can easily be stacked to construct a desired pipeline (Figure 2), which can also easily be extended by custom functions wherever needed. Available filters include extensive cleaning, correcting, checking of key metadata fields such as compound name, structure annotations (InChI, Smiles, InchiKey), ionmode, adduct, or charge. Many of the provided metadata cleaning filters were designed for handling and improving GNPS-style MGF or json datasets. For future versions, however, we aim to further extend this to other commonly used public databases. Figure 1: Flowchart of matchms workflow. Reference and query spectrums are filtered using the same set of set filters (here: filter A and filter B). Once filtered, every reference spectrum is compared to every query spectrum using the matchms.Scores object.
Current Python tools for working with MS/MS data include pyOpenMS (Röst et al. 2014), a wrapper for OpenMS (Röst et al. 2016) with a strong focus on processing and filtering of raw mass spectral data. pyOpenMS has a wide range of peak processing functions which can be used to further complement a Matchms filtering pipeline. Another, more lightweight and native Python package with a focus on spectra visualization is spectrum_utils (Bittremieux 2020). Matchms focuses on comparing and linking large number of mass spectra. Many of its build-in filters are aimed at handling large mass spectra datasets from common public data libraries such as GNPS.
Matchms provides functions to derive different similarity scores between spectra. Those include the established spectra-based measures of the cosine score or modified cosine score (Watrous et al. 2012). The package also offers fast implementations of common similarity measures (Dice, Jaccard, Cosine) that can be used to compute similarity scores between molecular fingerprints (rdkit, morgan1, morgan2, morgan3, all implemented using rdkit (Landrum, n.d.)). Matchms easily facilitates deriving similarity measures between large number of spectra at comparably fast speed due to score implementations based on Numpy (Walt, Colbert, and Varoquaux 2011), Scipy (Virtanen et al. 2020), and Numba (Lam, Pitrou, and Seibert 2015). Additional similarity measures can easily be added using the matchms API. The provided API also allows to quickly compare, sort, and inspect query versus reference spectra using either the included similarity scores or added custom measures. The API was designed to be easily extensible so that users can add their own filters for spectra processing, or their own similarity functions for spectral comparisons. The present set of filters and similarity functions was mostly geared towards smaller molecules and natural compounds, but it could easily be extended by functions specific to larger peptides or proteins.
Matchms is freely accessible either as conda package (https://anaconda.org/nlesc/matchms), or in form of source-code on GitHub (https://github.com/matchms/matchms). For further code examples and documentation see https://matchms.readthedocs.io/en/latest/. All main functions are covered by tests and continuous integration to offer reliable functionality. We explicitly value future contributions from a mass spectrometry interested community and hope that matchms can serve as a reliable and accessible entry point for handling complex mass spectrometry datasets using Python.

Example workflow
A typical workflow with matchms looks as indicated in Figure 1, or as described in the following code example. from matchms.importing import load_from_mgf from matchms.filtering import default_filters from matchms.filtering import normalize_intensities from matchms import calculate_scores from matchms.similarity import CosineGreedy # Read spectrums from a MGF formatted file file = load_from_mgf("all_your_spectrums.mgf") # Apply filters to clean and enhance each spectrum spectrums = [] for spectrum in file: spectrum = default_filters(spectrum) spectrum = normalize_intensities(spectrum) spectrums.append(spectrum) # Calculate Cosine similarity scores between all spectrums scores = calculate_scores(references=spectrums, queries=spectrums, similarity_function=CosineGreedy()) # Print the calculated scores for each spectrum pair for score in scores: (reference, query, score, n_matching) = score # Ignore scores between same spectrum and # pairs which have less than 20 peaks in common if reference is not query and n_matching >= 20: print (

Processing spectrum peaks and plotting
Matchms provides numerous filters to process mass spectra peaks. Below a simple example to remove low intensity peaks from a spectrum ( Figure 3).