Parent-map: analysis of parental contributions to evolved or engineered protein or DNA sequences

Parent-map analyzes protein or DNA sequences which are derived from one or multiple parent sequences, and shows parental contributions as well as differences from relevant parents. Originally developed to analyze capsid protein sequences obtained by directed evolution, parentmap can be used in any case where variant sequences are to be compared to parent sequences from which they are derived. Parent-map detects sequence shuffling as well as substitutions, insertions and deletions, and displays results in user-friendly formats. Parent-map is an opensource, platform-independent Python 3 script, available as a Bioconda package as well as a Windows program.


Statement of need
Adeno-associated virus (AAV) capsid directed evolution projects typically generate multiple enriched variant sequences after 2 to 5 rounds of selection starting from complex capsid libraries. For libraries developed from a single parental serotype, through random peptide insertion at a specific position or surface loop diversification in well-defined variable regions for example, a single multiple alignment of all enriched variant sequences against the parent sequence conveniently shows how each variant differs from the parent. However, when more than one parental sequence is involved, such as when different libraries are mixed together, or when a library design involves DNA shuffling from several parents, such alignments can quickly become illegible, particularly when the complete capsid gene is sequenced. In such cases, in the absence of appropriate software tools, each variant needs to be separately aligned against all possible parents, a time-consuming and cumbersome process. An added difficulty in the case of shuffled libraries is that, because of high sequence homology between parents, multiple regions will share sequence identities with more than one parent, complicating attempts at comprehensively defining the variant sequences in terms of parental contributions. To date, SALANTO (Herrmann et al., 2019) seems to be the only relevant publicly available software. However, it only applies to shuffled libraries, and its user-friendliness is limited as it requires the user to perform a multiple sequence alignment beforehand, and to further process the data manually after analysis. The software described in this article, parent-map, provides a user-friendly and comprehensive solution. It can be used with sequences derived from any type of library, or even with naturally-occurring mutants or rationally engineered variants. It is not limited to protein sequences. It only requires one file containing the variant sequences to be analyzed, and one file containing parental sequences, without any prior manipulation. It generates a set of five files covering most end-users' needs, in directly usable formats. Finally, although it was developed to address a need in the field of AAV capsid directed evolution, parent-map can be used whenever protein or DNA sequences, whether originating from natural evolution, directed evolution or rational design, are to be compared with one or more possible parental sequences.

Methods
Parent-map was written under Python 3.7 as both a command-line interface (CLI) and a graphical user interface (GUI) application, by allowing parser modules argparse and Gooey to coexist within a single file (the GUI will start if no argument is present, while any argument will cause parent-map to start in CLI mode). A parent-map Python package was created and uploaded to the Python Package Index (PyPI) according to packaging instructions. A parentmap Bioconda (Grüning et al., 2018) recipe based on the PyPI package was written and submitted according to instructions. A stand-alone Windows executable and its installation program were created using respectively PyInstaller and Inno Setup. The documentation was written using Sphinx.

Implementation
Parent-map is a platform-independent Python script that generates a set of five output files from two input files. Input file names and options can be entered as arguments at launch time, resulting in parent-map running in CLI mode, or within the GUI, which starts if parent-map is launched without arguments. This flexibility allows parent-map to be deployed in a variety of settings, as a simple desktop application or even as a bioinformatics pipeline component. The first input file contains the variant sequences, typically the most frequent or the most enriched sequences obtained at the completion of a directed evolution experiment. The other input file is a set of potential parental sequences to the variant sequences. The most useful files generated by parent-map, particularly in the case of variants derived from DNA shuffling, are parental contribution maps (file names ending in -par.txt and -par.html, the latter being a colorized version of the former). Instead of all possible combinations, the simplest map that can accurately describe the variant is shown, using as few parents and as few fragments as possible. Other output files include a statistics file summarizing the variant sequences main features, a sequence definition file comprehensively defining each variant in terms of its parents, and an alignment file showing how variants differ from their common parent.
Parent-map can be tested using the provided variant and parent sample files, based on available literature describing evolved and rationally designed AAV capsid variants. Variants AAV-DJ (Grimm et al., 2008), AAV2.5T (Excoffon et al., 2009), NP84 (Paulk et al., 2018) and OLIG001 (Powell et al., 2016) are derived from shuffled DNA libraries. Variants AAV-F (Hanlon et al., 2019), AAV-PHP.B (Deverman et al., 2016), 7m8 (Dalkara et al., 2013) and rAAV2-retro (Tervo et al., 2016) are derived from peptide insertion libraries. Variants SCH2, SCH9 (Ojala et al., 2018), LI-A and LI-C (Marsic et al., 2014) are derived from more complex rationally designed libraries. Variants AAV2i8 (Asokan et al., 2010) and AAV2-sept-Y-F (Petrs-Silva et al., 2011) were rationally designed. Using default settings, parent-map correctly identifies single parental contributions from AAV9 for variants AAV-F and AAV-PHP.B, single parental contributions from AAV2 for variants 7m8, rAAV2-retro, LI-A, LI-C, AAV2-sept-Y-F, and multiple parental contributions from AAV2, AAV8 and AAV9 for AAV-DJ, from AAV2 and AAV5 for AAV2.5T, from AAV2, AAV3B and AAV6 for NP84, from AAV2, AAV6, AAV8 and AAV9 for OLIG001, SCH2 and SCH9, and from AAV2 and AAV8 for AAV2i8. Parent-map also correctly detects peptide insertions FVVGQSY for AAV-F and TLAVPFK for AAV-PHP.B, both at position 588, and peptide insertions LALGETTRPA for 7m8 and LADQDYTKTA for rAAV2-retro, both at position 587. Finally, parent-map correctly identifies substitutions A to T at position 457 for AAV-DJ and at position 582 for AAV2.5T, substitutions K to E at 532 and R to G at 585 for NP84, E to K substitution at 532 and unmatched H at 726 for OLIG001, substitutions I to T at 240 and V to I at 718 for 7m8, substitutions N to D at 382 and V to I at 718 for rAAV2-retro, the 14 and 4 substitutions for LI-A and LI-C respectively, as well as the 7 Y to F substitutions at 252, 272, 444, 500, 700, 704 and 730 for AAV2-sept-Y-F.
A comprehensive description of parent-map is provided in the documentation.