mutyper : assigning and summarizing mutation types for analyzing germline mutation spectra

1 Department of Electrical Engineering & Computer Sciences, University of California, Berkeley, CA, United States of America 2 Department of Bioengineering, University of Washington, Seattle, WA, United States of America 3 Department of Genome Sciences, University of Washington, Seattle, WA, United States of America 4 Departments of Human Genetics and of Biomedical Informatics, University of Utah, Salt Lake City, UT, United States of America 5 The Roslin Institute, Royal (Dick) School of Veterinary Studies, University of Edinburgh, Easter Bush Campus, Midlothian, United Kingdom ¶ Corresponding author DOI: 10.21105/joss.05227


Summary
The germline mutation process drives genetic variation and provides the raw material for adaptive evolution. Germline mutations arise from spontaneous DNA damage caused by environmental mutagens, or errors in DNA replication. Populations and species may experience distinct mutational histories due to variation in environmental exposure, life history, and heritable variation in the machinery controlling DNA replication fidelity.
Mutational mechanisms often have mutation signatures in terms of the nucleotide sequence contexts where they act. Population genomics has given increasing attention to nucleotide sequence context in the study of the germline mutation process (reviewed in Carlson et al. (2020)). Single-nucleotide polymorphisms (SNPs) can be assigned to mutation types by the ancestral and derived nucleotide states and a window of local nucleotide context in the ancestral background. The mutation spectrum of an individual or population is the relative distribution of these mutation types.
Inter-and intra-specific germline mutation spectrum variation has revealed a dynamic and evolving germline mutation process shaping modern genomic diversity. Parsing mutation spectra temporally (via allele frequency) and spatially (via genomic annotations) has revealed the history and present of mutational processes, and applying such analysis to de novo mutation data may be clinically informative for rare or undiagnosed genetic diseases.
Here we describe mutyper, a command-line utility and Python package that assigns ancestrally polarized mutation types to SNP data, computes mutation spectra for individuals and populations, and computes sample frequency spectra stratified by mutation type for population genetic inference. Documentation is provided at https://harrispopgen.github.io/mutyper; source code is available at https://github.com/harrispopgen/mutyper.

Statement of need
Despite many exciting findings in this growing area, there is a lack of software for germline mutation type annotation and spectrum generation from population-scale genomic data. We developed mutyper, an open-source command-line utility and Python package, to address the field's need for efficient and well-tested software for both larger bioinformatics pipelines and exploratory analysis.
The literature on cancer somatic mutation signatures includes several software tools for clustering and dimensionality reduction that are either not scalable or not flexible enough for general population-scale germline variation data (Gehring et al., 2015;Goncearenco et al., 2017;Lee et al., 2018;S. Li et al., 2020;Manders et al., 2022;Rosales et al., 2017;Rosenthal et al., 2016), but the package helmsman (Carlson et al., 2018) enables partial interoperability with some of these tools. Complementing this work, mutyper is a flexible, efficient, and extensible software package for low-level bioinformatic workflows in germline mutation spectrum studies.

Implementation CLI
The core functionality of the mutyper command-line interface (CLI) is to augment SNP data (input or piped in VCF/BCF format) with ancestral mutation type annotations and stream to stdout. Fast and memory-efficient processing of VCF input (Danecek et al., 2011) is achieved with cyvcf2 (Pedersen & Quinlan, 2017), and mutation types are assigned via the INFO field for each variant via a key-value pair such as mutation_type=GAG>GTG. Reference and alternative alleles are polarized to the ancestral and derived states, respectively, and genotype counts are updated accordingly. The mutyper CLI is fully compatible with standard CLIs (i.e. bcftools (H. Li, 2011)) for filtering SNPs or samples, masking regions, and merging/concatenating VCFs.
To polarize ancestral and derived allelic states, and define ancestral -mer backgrounds, an ancestral genome in FASTA format is required. Mutyper uses the package pyfaidx (Shirley et al., 2015) for fast random access to ancestral genomic content, with minimal memory requirements. Ancestral genomes can be specified by various means. The ancestral FASTA sequence provided by the 1000 Genomes Project (1000Genomes Project Consortium et al., 2015 was estimated from a multi-species alignment using ortheus (Paten et al., 2008). In such a case, the ancestral FASTA can be passed to mutyper directly. Alternatively, mutyper can estimate ancestral states by polarizing SNPs using an outgroup genome aligned to the reference (e.g. the chimp genome liftover to the human reference genome).
The user may specify the -mer context size desired (e.g. = 3 for triplet mutation types). As in previous work, mutation type annotations are, by default, collapsed by reverse complementation such that the ancestral state is either A or C. Alternatively, a BED file can be supplied to define the strand orientation for nucleotide context at each site (e.g. according to direction of replication or transcription).
In addition to this core functionality, the CLI includes several other subcommands that summarize mutation-type-annotated SNP data piped from the core command described above. Individual-and population-level mutation spectra and sample frequency spectra are streamed to stdout in tab-separated form, and can be used to characterize modern mutation spectrum variation, and infer its evolutionary history.

Python API
The mutyper Python API exposes the functions above in an interactive notebook session to implement custom analyses of mutation type data by interfacing with the strong ecosystem of scientific computing packages available in Python. For example, dimensionality reduction (such as principal components analysis or non-negative matrix factorization) is often used to summarize mutation spectra, and the scikit-learn package (Pedregosa et al., 2011) can be used in conjunction with the mutyper API for this purpose. The mutyper API produces mutation spectra or sample frequency spectrum matrices as pandas data frames (McKinney, 2010), which can be easily manipulated, visualized, and analyzed with standard python scientific computing packages.
Applications mutyper was first used by DeWitt et al. (2021) alongside the Python package mushi to infer mutation rate histories from mutation spectra using coalescent theory. Sasani et al. (2022) used mutyper in work reporting the discovery of a mutator allele in a unique mouse model system. Vollger et al. (2022) used mutyper to analyze long-read sequencing data from humans, finding elevated mutation rates and distinct mutation spectra in segmentally duplicated regions. As of this writing, mutyper is being used in several ongoing studies in multiple labs.