dms2dfe: Comprehensive Workflow for Analysis of Deep Mutational Scanning Data

High throughput genotype to phenotype (G2P) data is increasingly being generated by widely applicable Deep Mutational Scanning (DMS) method. dms2dfe is a comprehensive end-to-end workflow that addresses critical issue with noise reduction and offers variety of crucial downstream analyses. Noise reduction is carried out by normalizing counts of mutants by depth of sequencing and subsequent dispersion shrinkage at the level of calculation of preferential enrichments. In downstream analyses, dms2dfe workflow provides identification of relative selection pressures, potential molecular constraints and generation of data-rich visualizations. Availability dms2dfe is implemented as a python package and it is available at https://kc-lab.github.io/dms2dfe. Contact kausik@igib.in, rohan@igib.in Supplementary information Supplementary data are available at Bioinformatics online.


Introduction
Recent developments in high throughput mutagenesis and massively parallel DNA sequencing culminated in a method popularly known as Deep Mutational Scanning (DMS) (1,2). The broad appeal of DMS method is evident by the wide range of applicability (3) relevant to structural modeling (4), protein stability (5), substrate specificity (6)(7)(8), protein-protein interactions (9)(10)(11)(12)(13)(14), mutational effects on organismal fitness (15)(16)(17), environmental effects (18), biology of viruses (19)(20)(21), fitness landscapes of antibiotic resistance (7,22) and proto-oncogenes (23). The distribution of fitness scores along fitness axis of fitness landscape i.e. distribution of fitness effects (DFE) is known as a powerful estimator of the underlying evolutionary dynamics and has been traditionally used to contextualize molecular evolution (24,25). One can characterize the nature of selection pressure in effect and potential susceptibility, evolvability or robustness of the individuals in the population by comparing DFEs.
Accounting and reducing for the noise is a major primary challenge in the analysis of differential count data. Adopting an alternative approach to the analysis than available tools (26-28), dms2dfe implements noise reduction utilizing both sequencing depth (Fig S1 A) and empirical Bayes shrinkage (29) (Fig S1 B). Along with upstream analysis to obtain variant counts, preferential enrichments and fitness estimates, dms2dfe offers utilities for downstream analysis such as identification of relative selection pressure, molecular constraints and visualizations of the data; thus constituting an end-to-end workflow. Comprehensively documented python programming library (API) (available at http://kc-lab.github.io/dms2dfe) provides a platform for further developments to adapt with advances and modifications in DMS experiments.

dms2dfe workflow
In a conventional DMS experiment, a pool of mutants is selected in a co-culture competition assay under selection pressure of interest and frequencies of the mutants are compared with a pool or mutants used as a reference (also referred as input or background). dms2dfe is applicable to all such approaches wherein a basic DMS experiment involves a comparative analysis of selected and reference pool of mutants. In addition to the shot gun and full length ultra-deep sequencing, dms2dfe also supports concatamer based approach (7) as well as multiplexing strategy using barcoded amplicons (30). Sequencing data whether aligned (.sam, .bam) or not (.fastq) can be provided as input of dms2dfe workflow. Alternatively, mutational data (frequencies, preferential enrichments or fitness score) can be provided to exclusively carry out downstream analyses.
As shown in Fig 1 A, in the comprehensive workflow of dms2dfe, following key aspects of the analysis of DMS experiment are addressed.

identification of relative selection pressure
As previously implemented (8), we classify the statistically neutral (least effect) mutants upper and lower thresholds from the preferential enrichments of reference (unselected) pool.
Depending on the nature of experiment, fitness scores can be estimated by one of several methods based on preferential enrichments of wild type alleles or otherwise preferential enrichments themselves can be used as fitness scores (see S1 Supporting Methods). So, mutants can be classified as beneficial, deleterious or neutral as follows, where, M i is i th kind of survived mutant and F i is its respective fitness score. Note that class neutral exclusively represents the synonymous mutants. Metrics such as relative number of mutations under the classes of fitness (Fig 1B), relative survivability i.e. difference in number of mutants survived in test condition with respect to control condition (∆n) and relative fitness changes i.e. difference in average fitness of mutants in test condition with respect to control condition (∆F) assist users to identify the directionality and strength of the relative selection pressures.

Identification of molecular constraints
Due to multiple reasons (31), effects of mutation at one site are dependent on residues at other sites. This results in a highly interdependent ensemble of molecular constraints which is difficult to decipher (32). Here we use ensemble machine learning algorithm -Gradient Boosting which allows modeling of the complex mutational data. It provides critical advantage of determination by relative importances of the features (Fig 1 C) and partial dependence i.e. marginal effect of features on the target i.e. fitness scores (Fig S2). Such information would help users in contextualizing G2P interactions in terms of how fitness scores are mechanistically constrained by molecular constraints (Fig S1 D).

Visualizations
Fitness data too can be visualized in three different formats. Fitness per individual mutations is visualized in the form of a heatmap also called sequence-function map, wherein each locus represents fitness of individual mutant (Fig 1 E). Fitness per substitutions too can be visualized in the form of matrix of average fitness scores per individual substitution (Fig 1 F). Average fitness values per position are projected onto the PDB structure by utilizing visualization modules of UCSF-Chimera (34) (Fig 1 G). Alongside, an integrated visualization of the change in frequencies of mutants at the level of individual mutations, substitutions and positions is generated (Fig S1 B).