SeroTools: a Python package for Salmonella serotype data analysis

Subtyping, the ability to differentiate and characterize closely related microorganisms, has historically been a critical component of successful outbreak identification and traceback efforts employed by public health researchers and regulatory agencies for foodborne pathogens. Serological subtyping (or serotyping) has been the standard approach, largely based on antibody binding to surface antigens (Henriksen, 1978). The identification of specific antigenic factors has facilitated the creation of serotyping schemes, which define each serovar using a specific (generally unique) combination of antigenic factors. Serotyping schemes have been developed to assist in characterization of many microorganisms, including pathogens such as Salmonella, E. coli, Shigella (Strockbine, Bopp, Fields, Kaper, & Nataro, 2015), Streptococcus (Spellerberg & Brandt, 2015), and H. influenzae (Ledeboer & Doern, 2015).


Summary
Subtyping, the ability to differentiate and characterize closely related microorganisms, has historically been a critical component of successful outbreak identification and traceback efforts employed by public health researchers and regulatory agencies for foodborne pathogens. Serological subtyping (or serotyping) has been the standard approach, largely based on antibody binding to surface antigens (Henriksen, 1978). The identification of specific antigenic factors has facilitated the creation of serotyping schemes, which define each serovar using a specific (generally unique) combination of antigenic factors. Serotyping schemes have been developed to assist in characterization of many microorganisms, including pathogens such as Salmonella, E. coli, Shigella (Strockbine, Bopp, Fields, Kaper, & Nataro, 2015), Streptococcus (Spellerberg & Brandt, 2015), and H. influenzae (Ledeboer & Doern, 2015).
The WKL scheme currently recognizes two species of Salmonella, S. enterica and S. bongori. S. enterica is comprised of six subspecies (subsp.): enterica (I), salamae (II), arizonae (IIIa), diarizonae (IIIb), houtenae (IV) and indica (VI). Note that S. bongori is still frequently designated as subsp. V for scheme consistency, although it is no longer considered a subspecies of S. enterica. The WKL scheme assigns a unique name (e.g. serovar Enteritidis) to each of the serovars of S. enterica subsp. enterica (I), while the serovars representing the other subspecies are referred to by their antigenic formulae. The antigenic formula formatting is defined by the WKL scheme and is demonstrated for serovar Agona in Figure 1. The formula contains a subspecies designation and a colon-separated list of antigenic factors for which the following fields are required: O antigen, phase 1 H antigen, and phase 2 H antigen. The field for 'Other H' antigen includes R phases and third phases and is present only when populated. An antigenic formula may include additional annotation such as:

1.
Square brackets to indicate optional factors, (e.

2.
Underlining to indicate O factors present only in the presence of the converting phage, represented here and in SeroTools as optional (with square brackets) due to the inability to capture typographical formatting in plain text, (e.g. I [1],9,12:e,h:1,5).
These additional annotations are captured in the SeroTools repository and employed for determination of congruence between serovars.

Statement of Need
SeroTools addresses multiple critical needs for the efficient analysis of Salmonella serotyping data within the public health community. In recent years, significant technological advances have resulted in a wide range of molecular-based subtyping options, including highly sensitive approaches based on whole genome sequencing (WGS). One such approach involves the application of software tools to WGS data for in silico serovar  Zhang et al., 2019Zhang et al., , 2015. In light of the growing interest in in silico serovar prediction and serotyping method-comparison studies, SeroTools provides unique tools which fill multiple gaps in the analysis process. It serves as the only multiformat WKL repository accessible for software development. Currently the WKL scheme is available only as a pdf document (Grimont & Weill, 2007) and as Python lists in SeqSero (Zhang et al., 2015) and SeqSero2 . SeroTools also provides the only existing tools for querying the WKL scheme, comparing serovars for congruence, and predicting the most abundant serovar for clusters of isolates.

Functionality and Features
The SeroTools Python package provides the following functionality:

2.
Toolkit -• query -SeroTools provides the ability to easily query the WKL repository with serovar names or antigenic formulas.
• compare -SeroTools provides a convenient method for automated comparison of serovar designations, including increased differentiation for levels of congruence.
• cluster -SeroTools includes methods for robust determination of the most abundant serovar for a cluster of isolates.

•
SeroTools includes Pythonic data structures and a host of utility functions for analyzing and manipulating large Salmonella serovar datasets. Other functionality includes the ability to determine the antigenic factors common to a group of serovars.
SeroTools defines four levels of congruence for use in querying the repository and comparing serovars. Note -optional factors as referenced below include optional, exclusive, and weakly agglutinable factors, as specified in the WKL scheme.

1.
Exact matches must meet one of the following criteria: • The serovar designations are the identical string. • The subspecies designations are identical and neither serovar designation includes any antigenic factors.

2.
Congruent matches must meet all of the following criteria:

•
The subspecies field must be present either for both serovars or for neither.

•
All required antigenic factors match.
• Any differences are due to the presence/absence of optional factors.

3.
Minimally congruent matches must meet the following criteria: • Every antigen of at least one serovar can be considered a formal subset of the corresponding antigen (no direct conflicts). Note -the empty set (-) is a subset of every set. The 'minimally congruent' designation is unique to SeroTools and is useful for distinguishing between two scenarios: serovars which differ due to sample misannotation (truly incongruent) and serovars derived from correctly annotated samples with variation based solely on missing information. When comparing serovar predictions, minor differences may be expected due to method-specific irregularities, for example, reagent variation for laboratory-based techniques or sequencing read coverage for in silico techniques. Our assumption is that these minor method-specific differences are more likely manifested as missing data (e.g. all but one of the correct factors were detected) than direct conflicts.