Rosalution: Supporting data accessibility, integration, curation, interoperability, and reuse for precision animal modeling

1 Center for Computational Genomics and Data Sciences, Heersink School of Medicine, The University of Alabama at Birmingham, Birmingham, Alabama, United States of America 2 Department of Genetics, Heersink School of Medicine, The University of Alabama at Birmingham, Birmingham, Alabama, United States of America 3 Department of Pediatrics, Heersink School of Medicine, The University of Alabama at Birmingham, Birmingham, Alabama, United States of America 4 Hugh Kaul Precision Medicine Institute, Heersink School of Medicine, The University of Alabama at Birmingham, Birmingham, Alabama, United States of America 5 Department of Cell, Developmental and Integrative Biology, Heersink School of Medicine, The University of Alabama at Birmingham, Birmingham, Alabama, United States of America ¶ Corresponding author * These authors contributed equally. DOI: 10.21105/joss.05443

applications for these models.Widespread adoption of whole exome and genome sequencing has accelerated the identification of disease genes and rare disease-causing variations.Advanced algorithms and software have streamlined data analysis, helping researchers filter out common genetic variants and prioritize rare, potentially disease-causing ones.CRISPR-Cas9 technology has made precise genetic alterations in animals more accessible.Collectively, these advancements have revolutionized our ability to identify and understand molecular variations underlying human diseases, greatly enhanced our knowledge of biological processes and disease mechanisms, and empowered us to test therapies and drugs using precision animal models.
Selecting the right variants for precision animal models often involves reviewing research publications and, in some cases, employing custom scripts and pipelines to integrate data from various sources.These datasets and tools encompass variant annotation software like ANNOVAR, Variant Effect Predictor (VEP) (McLaren et al., 2016), or SnpEff, which provide crucial information about variant location, functional impact, and potential disease associations.Variant allele frequency and disease association repositories like ClinVar (Landrum et al., 2018), ExAC, and gnomAD offer insights into variant prevalence in the general population and their links to diseases.Damage prediction algorithms like VAAST, PolyPhen-2 (Adzhubei et al., 2010), and SIFT (Ng & Henikoff, 2001) aid in predicting variant pathogenicity based on sequence conservation, functional impact, and population frequencies.Rare disease collaborative databases and platforms are also valuable resources for variant selection and interpretation.
While the development of pipelines and tools for precision animal modeling requires significant effort, there's a notable tool called MARRVEL (Model organism Aggregated Resources for Rare Variant ExpLoration) designed to assist users in exploring data from these repositories for variant consideration.MARRVEL provides a wealth of curated information about human genes and variants, along with their orthologous genes in seven model organisms.It aids in assessing whether a variant of unknown significance (VUS) in a known disease-causing gene or a variant in a gene of uncertain significance (GUS) might be pathogenic.The tool aggregates data from various sources, including OMIM (Amberger et al., 2015), ExAC/gnomAD, ClinVar (Landrum et al., 2018), Geno2MP, DGV, and DECIPHER, and offers insights into orthologous genes, expression patterns, and Gene Ontology (GO) terms across both human and model organisms.
Although MARRVEL provides extensive, carefully selected, and organized information, it doesn't currently support direct user curation or annotation.Nevertheless, the ability to curate data, as seen in platforms like Rosalution, offers numerous advantages.It encourages collaboration and knowledge sharing within the scientific community, driving researchers and experts to contribute insights and annotations.User-curated data enables rapid integration of new research findings, maintains standardized data representation, and reduces the likelihood of errors or misinterpretations.As precision medicine and animal modeling continue to advance, user-curated data plays a vital role in staying at the forefront of scientific discovery.

Statement of Need
Gene editing approaches are used to generate precision disease animal models (e.g., cells, worms, zebrafish) carrying patient derived variants.Understanding the specific cellular and molecular impact of these variants in such model systems support the efforts to derive, diagnose, and provide therapies for ultra-rare diseases.Generating these models can take months to years.
The process of vetting candidate variants is generally a manual, non-systematic, and inefficient process, performed using different methods and datasets generated or curated by hundreds of cell and molecular biology labs worldwide.Researchers invest many hours gathering data and reviewing candidates using a series of disparate tools.Project tracking that collected information and any additional generated data are rarely available in an accessible and structured format.Criteria used in decision-making often need to be better standardized, validated, or tested.Both ingested and generated data and metadata are often incompletely retained and thus lost for reuse.We developed Rosalution to centralize these collaborative efforts via an accessible website client and application programming interface (API).A design-first approach was selected and focused on creating a seamless experience that guides teams through a collaborative analysis process keeping functionality and accessibility in mind.The Rosalution web client implements this design as a VueJS single-page architecture (SPA) website and a FastAPI Python service enabling programmatic access following the OpenAPI standard, which deploys interactive API documentation.Rosalution persistently stores its state in a MongoDB NoSQL database.
Rosalution facilitates three aspects of the case review process: • Augmenting and standardizing case and variant/gene intake and annotation with configurable automated annotation from publicly available data sources

Analysis Intake
Figure 1: A compilation of Rosalution screenshots of an analysis with its supporting evidence attachments.
New cases in Rosalution are uploaded as a JSON file either via the web client user interface (UI) or web API.The data for analysis is structured programmatically from a predefined template populated with data from the uploaded JSON.Once the new case persists in the database, the API sends an HTTP response noting successful creation.Researchers then begin preparing the case within the Rosalution web client with additional insights and supporting evidence, as seen in Figure 1.In the background, the Rosalution API queues an annotation task for each dataset associated with the case's variants, genes of interest, and clinical data.These annotation tasks are processed in an external thread pool to not block incoming HTTP traffic to Rosalution's API, keeping the application free to use while performing annotation in the background.An abstract Python class defines annotation tasks with an interface for subclasses to implement reading datasets from a specified data source.Once fetched, the data is returned to a Python dictionary to be extracted and saved in the database using the jq Python module.
The application provides the jq module with the query to extract the dataset's value as defined in the configuration of the annotation task.This design supports simul annotating from a variety of disparate sources of REST APIs and databases referenced in Table 1, with planned support in the future of variant call format (VCF) files, databases, etc.  (Landrum et al., 2018) Database Interpreted Conditions and Interpretation Entrez Gene (Maglott et al., 2007) Database Entrez Gene Id Ensembl Data (Zerbino et al., 2018), Ensembl REST API (Yates et al., 2015), Ensembl VEP (McLaren et al., 2016) Database, REST API, Tool via REST API Ensembl Gene Id, Consequences, Impact, Polyphen Prediction and Score (Adzhubei et al., 2010), ClinVar Ids, RefSeq Transcript Id (O'Leary et al., 2016), SIFT Prediction and Score (Ng & Henikoff, 2001), CADD (Rentzsch et al., 2021) HUGO Gene Nomenclature Committee (HGNC) (Seal et al., 2023) Database HGNC Gene Id Human Phenotype Ontology (HPO) (Köhler et al., 2021) REST API and Database Entrez Gene Id, OMIM, Disease Associations, HPO Term Association Online Mendelian Inheritance in Man (OMIM) (Amberger et al., 2015) Database Database OMIM, Disease Associations Disease Associations,

Collaborative Analysis
Within the UI, Rosalution displays a summary of the case (gene, variant, nominator, and unique ID) as a case card.A subset of recent case cards shows along with a search bar.Selection of a case opens the case record.The web client splits data in the record into two sections.The case section shows clinical and case specific genetic information, including age, sex, onset, literature evidence, variant data/interpretations, disease and phenotype associations, prior testing, and clinical utility statements.
Genes and variants of interest are presented at the top of the record as seen in Figure 2. Clicking on either the gene or variant renders its annotations.Variant-specific data, including pathogenicity, allele frequency, impact, druggability, functional associations, and cellular context are presented.When displaying annotations, the web client queries the web API for a configuration stored in the database that determines how annotations are displayed.By investing in implementation of a configurable visual rendering we can rapidly adjust the data representation based on how the users are using the data.
A research team member is assigned to review and add any pertinent annotations from the patient records.The application's web client interacts with the web API to persist the changes.Once the case is open and assigned for analysis additional collaborators can further curate by adding additional supporting evidence and files as they review the case prior to the review meeting.Curating is done by attaching hyperlinks to online resources or files supplemented with comments.During review meetings, when the entire assigned team seeks to decide on the nomination, any novel data used to make decisions are attached to the case as part of the review process.This way, all expert curations and important datasets integrate into a case within a single repository as a compilation of data and visuals added.

Conclusion
In conclusion, Rosalution is an open source tool for facilitating collaborative analysis for model generation in the rare genetic disease research community.It supports the process of animal modeling from case intake to decision making.Benefits of this platform include (1) more efficient data analysis through aggregation and automated annotation as well as support for both synchronous and asynchronous collaboration, (2) a reduction in errors via a focus on increasing data standardization and reducing knowledge loss by supporting the real-time collection of curations and evidence via a web-based user interface and API, and (3) an increase in data sharing with its focus on ability to data mine across all records.Rosalution shows potential for growth and scalability as it opens its development to the broader open-source and open-science communities.

Figure 2 :
Figure 2: A compilation of Rosalution screenshots of an analysis and annotations for the gene and variants of interest.

Table
• Supporting expert curation by clinical and research experts via a web-based interface • Supporting synchronous and asynchronous collaborative review by interdisciplinary teams via a web-based interface

Table 1 :
Data Sources and Tools utilized for Gene and Variant Annotations