Sciris: Simplifying scientific software in Python

1 Institute for Disease Modeling, Global Health Division, Bill & Melinda Gates Foundation, Seattle, USA 2 School of Physics, University of Sydney, Sydney, Australia 3 Burnet Institute, Melbourne, Australia 4 CAE USA, Tampa, USA 5 Saffron Software, Bucharest, Romania 6 Melbourne Data Analytics Platform, The University of Melbourne, Melbourne, Australia 7 Department of Mathematical Sciences, University of Copenhagen, Copenhagen, Denmark 8 Google, Zürich, Switzerland ¶ Corresponding author DOI: 10.21105/joss.05076


Statement of need The landscape of scientific software
With the increasing availability of large volumes of data and computing resources, scientists across multiple fields of research have been able to tackle increasingly complex problems. But to harness these resources, the need for domain-specific software has become much greater. As the complexity of the questions being tackled has increased, so too has the amount of code used to answer them, creating a steep learning curve and significant burden of code review (Nature Editorial Board, 2018). For some scientists, this increasing reliance on software has created a barrier between themselves and the science they want to do. It is these people -people who want things to "just work" rather than worry about the implementation details -who are the primary audience for Sciris. (In contrast, people who care a lot about implementation detailssuch as those who love using type hints -will likely not find Sciris to be as helpful.) Scientific code workflows (e.g., either a full cycle in the development of a new software library, or in the execution of a one-off analysis) typically rely on multiple codebases, including but not limited to: low-level libraries, domain-specific open-source software, and self-developed and/or inherited Swiss-Army-knife toolboxes (whose original developer may or may not be around to pass on undocumented wisdom). Several scientific communities have adopted collaborative, community-driven, open-source software approaches due to the significant savings in development costs and increases in code quality that they afford, such as astropy (Robitaille et al., 2013), fmriprep (Esteban et al., 2019), and nextstrain (Hadfield et al., 2018). Despite this progress, a large fraction of scientific software development efforts remain a solo adventure (Kerr, 2019). This leads to proliferation of tools where resources are largely spent reinventing wheels of variable quality, which jeopardizes the code's minimum requirements of being "re-runnable, repeatable, reproducible, reusable, and replicable" (Benureau & Rougier, 2018). In addition, low-level programming abstractions can make it harder to clarify the science. For instance, one of the reasons PyTorch has become popular in academic and research environments is its success in making models easier to write compared to TensorFlow (Lorica, 2017). The need for libraries that provide "simplifying interfaces" for research applications is reflected in the development of multiple libraries in scientific Python ecosystems that have enabled researchers to focus their time and efforts on solving problems, prototyping solutions, deploying applications, and educating their communities. In addition to PyTorch (simplifying/extending Tensorflow), other examples include seaborn (simplifying/extending Matplotlib) (Waskom, 2021), pingouin (simplifying/extending pandas), and PyVista (simplifying/extending VTK) (Sullivan & Kaszynski, 2019), among many others. Sciris adds to this ecosystem as a "library of the gaps", addressing annoyances that are too small-scale to each need a dedicated library of their own, but common enough that together they add up to significant coding burden.

Sciris in practice
The name Sciris is a portmanteau of "scientific" and "iris" (a reference to seeing clearly, as well as the Greek word for "rainbow"). We began work on it in 2014, initially to support development of Optima HIV (Kerr et al., 2015(Kerr et al., , 2020. We repeatedly encountered the same inconveniences while building scientific webapps, and so we began collecting the tools we used to overcome them into a shared library. While Python is considered an easy-to-use language for beginners, the motivation that shaped Sciris' evolution was to further lower the barriers to accessing the numerous supporting libraries we were using. Our investments in Sciris paid off when in early 2020 its combination of brevity and simplicity proved crucial in enabling the rapid development of the Covasim model of COVID-19 transmission (Kerr et al., 2021). Covasim's relative simplicity and readability, based in large part on its heavy use of Sciris, enabled it to become one of the most widely adopted models of COVID-19, used by students, researchers, and policymakers in over 30 countries .
In addition to Optima HIV and Covasim, Sciris is currently used by many other scientific software tools, such as Optima Nutrition (Pearson et al., 2018), the Cascade Analysis Tool We believe using Sciris can lead to more efficient scientific code production for solo developers and teams alike, including increased longevity of new scientific libraries (Perkel, 2020). Some of the key functional aspects that Sciris provides are: (i) brevity through simple interfaces; (ii) "dejargonification"; (iii) fine-grained exception handling; and (iv) version management. We expand on each of these below, but first provide a vignette that illustrates many of Sciris' features.

Vignette
Compared with a domain-specific language like MATLAB, even relatively simple scientific code in Python can require significant boilerplate. This extra code can obscure the key logic of the scientific question being addressed.
For example, imagine that we wish to sample random numbers from a user-defined function with varying noise levels, save the intermediate calculations, and plot the results. In vanilla Python, each of these operations is somewhat cumbersome. Figure 1 presents two functionally identical scripts; the one written with Sciris is considerably more readable and succinct.
This vignette illustrates many of Sciris' most-used features, including timing, parallelization, feature-rich containers, file saving and loading, and plotting. For the lines of the script that differ, Sciris reduces the number of lines of code required from 33 to 7, a 79% decrease.

Design philosophy
The aim of Sciris is to make common tasks simpler. Sciris includes implementations of heavily used code patterns and abstractions that facilitate the development and deployment of complex domain-specific scientific applications, and helps non-specialist audiences interact with these applications. We note that Sciris "stands on the shoulders of giants", and as such is not intended as a replacement of these libraries, but rather as an interface that facilitates a more effective and sustainable development process through the following principles: Brevity through simple interfaces. Sciris packages common patterns requiring multiple lines of code into single, simple functions. With these functions one can succinctly express and execute frequent plotting tasks (e.g., sc.commaticks, sc.dateformatter, sc.plot3d); ensure consistent types, including containers (e.g., sc.toarray, sc.mergedicts, sc.mergelists), or even perform line-by-line performance profiling (sc.profile). Brevity is also achieved by extending functionality of well established objects (e.g., OrderedDict via sc.odict) and methods (e.g., isinstance via sc.checktype that enables the comparison of objects against higher-level types like arraylike), as well as wrapping useful third-party libraries (e.g., pyyaml via sc.loadyaml). In providing a curated collection of common data science tools, Sciris has similarities to R's tidyverse.
Dejargonification. Sciris aims to use plain function names (e.g., sc.smooth, sc.findnearest, sc.safedivide) so that the resulting code is as scientifically clear and human-readable as possible. Sciris also provides some MATLAB-like functionality, and uses the same names (e.g., sc.tic and sc.toc; sc.boxoff) to minimize the learning curve for scientists who have MATLAB experience.
Fine-grained exception handling. Across many classes and functions, Sciris uses the keyword die, enabling users to set a locally scoped level of strictness in the handling of exceptions. If die=False, Sciris is more forgiving and softly handles exceptions by using its default (opinionated) behavior, such as printing a warning and returning None so users can decide how to proceed. If die=True, it directly raises the corresponding exception and message. This flexibility reduces the need for try-catch blocks, which can distract from the code's scientific logic.
Version management. Keeping track of dates, authors, and code versions, plus additional notes or comments, is an essential part of scientific projects. Sciris provides methods to easily save and load metadata to/from figure files, including Git information (sc.savefig, sc.gitinfo, sc.loadmetadata), as well as shortcuts for comparing module versions (sc.compareversions) or requiring them (sc.require).

Examples of key features
Here we illustrate a smattering of key features in greater detail; further information on installation and usage can be found at docs.sciris.org. Figure 3 illustrates the functional modules of Sciris. Sciris is available on pip (pip install sciris).

Parallelization
A frequent hurdle scientists face is parallelization. Sciris provides sc.parallelize, which acts as a shortcut for using multiprocess.Pool(). By default it adjusts the pool size based on the CPUs available, but can also use either a fixed number of CPUs or allocate them dynamically based on load (sc.loadbalancer). This example shows three equivalent ways to iterate over multiple complex arguments:

ScirisWeb
While a full description of ScirisWeb is beyond the scope of this paper, briefly, it builds on Sciris to enable the rapid development of Python-based webapps, including those powering Covasim and Optima Nutrition. By default, ScirisWeb uses Vuejs and sciris-js for the frontend, Flask as the web framework, Redis for the (optional) database, and Matplotlib/mpld3 for plotting. However, ScirisWeb is completely modular, which means that it could also be used to (for example) link a React frontend to a MySQL database with Plotly figures. This modularity is in contrast to full-stack solutions such as Shiny for Python, Plotly Dash, Streamlit, and Voilà. While these libraries are even easier to use than ScirisWeb (since they do not require any knowledge of JavaScript), they provide limited options for customization or switching between technology stacks. In contrast, ScirisWeb provides the flexibility of a custom-written webapp within the context of an "it just works" framework.

Beyond Sciris
Like seaborn, Sciris aims to "facilitate rapid exploration and prototyping through named functions and opinionated defaults" (Waskom, 2021). Eventually, a time may come when the user's opinions diverge from Sciris' defaults. Since most Sciris functions are standalone, individual functions can be replaced on as as-needed basis. For example, in situations where strictness is an asset (e.g., low-level libraries where an unexpected type is indicative of an error), the added flexibility that Sciris provides (e.g., the type-agnostic sc.toarray) can be a liability. As another example, sc.odict adds small but nonzero overhead to the dict built-in. While in most cases this performance difference is negligible (<500 ms per million set/get operations), for innermost loops of compute-intensive applications, dict should be used instead. Finally, since Sciris aims for breadth rather than depth, Sciris functions may eventually need to be supplanted by more powerful alternatives. For example, while sc.parallelize provides one-line parallelization on a local machine or single virtual machine, parallelizing across multiple machines requires more powerful libraries such as Dask (Rocklin, 2015), Ray, or Celery.