Hypothesis: A new approach to property-based testing

Property-based testing is a style of testing popularised by the QuickCheck family of libraries, first in Haskell (Claessen & Hughes, 2000) and later in Erlang (Arts, Hughes, Johansson, & Wiger, 2006), which integrates generated test cases into existing software testing workflows: Instead of tests that provide examples of a single concrete behaviour, tests specify properties that hold for a wide range of inputs, which the testing library then attempts to generate test cases to refute. For a general introduction to property-based testing, see (MacIver, 2019).

Python has a rich and thriving ecosystem of scientific software, and Hypothesis is helpful for ensuring its correctness. Any researcher who tests their software in Python can benefit from these facilities, but it is particularly useful for improving the correctness foundational libraries on which the scientific software ecosystem is built. For example, it has found bugs in astropy (Price-Whelan et al., 2018) 4 and numpy (Walt, Colbert, & Varoquaux, 2011) Additionally, Hypothesis is easily extensible, and has a number of third-party extensions for specific research applications. For example, hypothesis-networkx 6 generates graph data structures, and hypothesis-bio 7 generates formats suitable for bioinformatics. As it is used by more researchers, the number of research applications will only increase.

Hypothesis for Software Testing Research
Hypothesis is a powerful platform for software testing research, both because of the wide array of software that can be easily tested with it, and because it has a novel implementation that solves a major difficulty faced by prior software testing research.
Much of software testing research boils down to variants on the following problem: Given some interestingness condition (e.g., that it triggers a bug in some software), how do we generate a "good" test case that satisfies that condition?
Particular sub-problems of this are: Traditionally property-based testing has adopted random test-case generation to find interesting test cases, followed by test-case reduction (see Regehr et al. (2012), Zeller & Hildebrandt (2002)) to turn them into more human readable ones, requiring the users to manually specify a validity oracle (a predicate that identifies if an arbitrary test case is valid) to avoid invalid test cases.
The chief limitations of this from a user's point of view are: • Writing correct validity oracles is difficult and annoying. • Random generation, while often much better than hand-written examples, is not especially good at satisfying difficult properties. • Writing test-case reducers that work well for your problem domain is a specialised skill that few people have or want to acquire.
The chief limitation from a researcher's point of view is that trying to improve on random generation's ability to find bugs will typically require modification of existing tests to support new ways of generating data, and typically these modifications are significantly more complex than writing the random generator would have been. Users are rarely going to be willing to undertake the work themselves, which leaves researchers in the unfortunate position of having to put in a significant amount of work per project to understand how to test it.
Hypothesis avoids both of these problems by using a single universal representation for test cases. Ensuring that test cases produced from this format are valid is relatively easy, no more difficult than ensuring that randomly generated tests cases are valid, and improvements to the generation process can operate solely on this universal representation rather than requiring adapting to each test.
Currently Hypothesis uses this format to support two major use cases: 1. It is the basis of its approach to test-case reduction, allowing it to support more powerful test-case reduction than is found in most property-based testing libraries with no user intervention. 2. It supports Targeted Property-Based Testing (Löscher & Sagonas, 2017), which uses a score to guide testing towards a particular goal (e.g., maximising an error term). In the original implementation this would require custom mutation operators per test, but in Hypothesis this mutation is transparent to the user and they need only specify the goal.
The internal format is flexible and contains rich information about the structure of generated test cases, so it is likely future versions of the software will see other features built on top of it, and we hope researchers will use it as a vehicle to explore other interesting possibilities for test-case generation.