kiwiPy: Robust, high-volume, messaging for big-data and computational science workflows

In this work we present kiwiPy, a Python library designed to support robust message based communication for high-throughput, big-data, applications while being general enough to be useful wherever high-volumes of messages need to be communicated in a predictable manner. KiwiPy relies on the RabbitMQ protocol, an industry standard message broker, while providing a simple and intuitive interface that can be used in both multithreaded and coroutine based applications. To demonstrate some of kiwiPy's functionality we give examples from AiiDA, a high-throughput simulation platform, where kiwiPy is used as a key component of the workflow engine.


I. SUMMARY
The computational sciences have seen a huge increase in the use of high-throughput, automated, workflows over the course of the last two decades or so. Focusing on just our domain of computational materials science, there have been several large scale initiatives to provide high-quality results from standardised calculations [1][2][3][4][5][6]. Almost all of these repositories are populated using results from high-throughput quantum mechanical calculations that rely on workflow frameworks [7,8], including our own AiiDA 1 [9,10] which powers the Materials Cloud 2 . One of the many challenges for such frameworks is maximising fault-tolerance whilst simultaneously maintaining high-throughput, often across several systems (typically the client launching the tasks, the supercomputer carrying out the computations and the server hosting the database).
On the software level these problems are perhaps best addressed by using messaging brokers that take responsibility for guaranteeing the durability and atomicity of messages and often enable event based communication. Indeed, solutions such as RabbitMQ 3 see widespread adoption in industry, however, adoption in academia has been more limited, with home-made queue data structures, race condition susceptible locks and polling based solutions being commonplace. This is likely due to message brokers typically having complex APIs (which reflect the non-trivial nature of the underlying protocol) as well as the lack of familiarity with event-based programming in general within the community. KiwiPy 4 was designed specifically to address both these issues, by providing a tool that allows building robust event based systems with an interface that is as simple as possible.
KiwiPy provides three main message types to the user, task queues, Remote Procedure Calls (RPCs), and, broadcasts. These are all exposed via one class called the 'Communicator' which can be trivially constructed by providing a URI string pointing to the Rab-bitMQ server. This starkly contrasts with the several steps that are required to establish a connection to perform each of the message types listed above using standard libraries 5 . Furthermore, by default, kiwiPy creates a separate communication thread that the user never sees, allowing them to interact with the communicator using familiar Python syntax, without the need to be familiar with either coroutines or multithreading. This has the additional advantage that kiwiPy will maintain heartbeats (a periodic check to make sure the connection is still alive) with the server whilst the user code can be doing other things. Heartbeats are an essential part of RabbitMQ's fault tolerance whereby two missed checks will automatically trigger the message to be requeued to be picked up by another client.
To demonstrate some of the possible usage scenarios, we briefly outline the way kiwiPy is used in AiiDA. AiiDA, amongst other things, manages the execution of complex workflows made up of processes which may have checkpoints.

A. Task queues
As is common for high-throughput workflow engines, AiiDA maintains a task queue to which processes are submitted (typically from the user's workstation). These tasks are then consumed by multiple daemon processes (which may also be on the user's workstation or remote) and will only be removed from the task queue once they have been acknowledged to be completed by the consumer. The daemon can be gracefully or abruptly shut down and no task will be lost, since the task will simply be requeued by the broker once it notices that the consumer has died. Furthermore, there are no worries of race conditions between multiple daemon processes, since the task queue guarantees to only distribute each task to, at most, one consumer at a time.

B. Remote Procedure Calls
These are used to control live processes. Each process has a unique identifier and can be sent a 'pause', 'play' or 'kill' message, the response to which is optionally sent back to the initiator to indicate success or otherwise.

C. Broadcasts
These currently serve two purposes: sending 'pause', 'play' or 'kill' messages to all processes at once by broadcasting the relevant message and to control the flow between processes. If a parent process is waiting for a child to complete it will be informed of this via a broadcast message by the child saying that its execution has terminated. This enables decoupling as the child need not know about the existence of the parent.
Together these three message types allow AiiDA to implement a highly-decoupled, distributed, yet, reactive system that has proven to be scalable from individual laptops to workstations, driving simulations on highperformance supercomputers with workflows consisting of varying durations, ranging from milliseconds up to multiple days or weeks.
It is our hope that by lowering the barriers to adoption, kiwiPy will bring the benefits of industry grade message brokers to academia and beyond, ultimately making robust scientific software easier to write and maintain.

II. ACKNOWLEDGEMENTS
We would like to thank Giovanni Pizzi, Nicola Marzari, and the AiiDA team for their continuous coordination and development of the project. We also thank Jason Yu for contributing the first version of the documentation. This work is supported by the MARVEL National Centre for Competency in Research funded by the Swiss National Science Foundation (grant agreement ID 51NF40-182892) and the European Materials Modelling Council-CSA funded by the European Union's Horizon 2020 research and innovation programme under Grant Agreement No 723867.