Melissa: coordinating large-scale ensemble runs for deep learning and sensitivity analyses

HAL is a multi-disciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.


Summary
Melissa is a file avoiding, fault tolerant, and elastic framework, generalized to perform ensemble runs such as large scale sensitivity analysis and large scale deep surrogate training on supercomputers.Some of the largest Melissa studies so far employed up to 30k cores to execute 80k parallel simulations while avoiding up to 288 TB of intermediate data storage (see (Ribés et al., 2022)).These large-scale studies avoid intermediate file storage due to Melissa's "online" (also referred to as in-transit and on-the-fly) data handling approach.As shown in Fig. 1, Melissa's architecture relies on three interacting components, the launcher, the server, and the client: 1. Melissa client: the parallel numerical simulation code turned into a client.Each client sends its output to the server as soon as available.Clients are independent jobs.
2. Melissa server: a parallelized process in charge of processing the data upon arrival from the distributed and parallelized clients (e.g., computing statistics or training a neural network).
3. Melissa Launcher: the front-end Python script in charge of orchestrating the execution of the study.This piece of code interacts directly with OpenMPI or with the cluster scheduler (e.g., slurm or OAR) to submit and monitor the proper execution of all instances.The Melissa server component is designed to be specialized for various types of ensemble runs:

Sensitivity Analysis (melissa-sa)
Melissa's sensitivity analysis server is built around two key concepts: iterative (sometimes also called incremental) statistics algorithms and asynchronous client/server model for data transfer.Simulation outputs are never stored on disk.Instead, they are sent via NxM communication patterns from the simulations to a parallelized server (Fig. 1).This method of data aggregation enables the calculation of rapid statistical fields in an iterative fashion, without storing any data to disk.Avoiding disk storage opens up the ability to compute oblivious statistical maps for all mesh elements, for every time step and on a full resolution study.Melissa comes with iterative algorithms for computing various statistical quantities (e.g., mean, variance, skewness, kurtosis, and Sobol indices) and can easily be extended with new algorithms.

Deep Surrogate Training (melissa-dl)
Melissa's deep learning server adopts a similar philosophy.Clients communicate data in a round-robin fashion to the parallelized server (Fig. 1).The multi-threaded server then puts and pulls data samples in and out of a buffer (Fig. 2), which is used for building training batches.Melissa can perform data distributed parallelism training on several GPUs, associating a buffer to each of them.To ensure proper memory management during execution, samples are selected and evicted according to a predefined policy.This strategy enables the online training method shown in Fig. 2. Furthermore, the Melissa architecture is designed to accommodate popular deep learning libraries such as PyTorch and Tensorflow.

State of the field
Melissa is unique in many ways, but there are a group of other open-source codes aiming to help scientists manage large scale analyses on supercomputers.For example, Merlin (Merlin, 2022) and Radical Pilot (Merzky et al., 2021) are supercomputing tools designed to help reduce friction in large scale ensemble runs dependent on file system I/O.Meanwhile, a group of frameworks exist that are aimed at distributing Python processes across clusters including Ray (Moritz et al., 2017) and Dask (Dask Development Team, 2016), but they do not support MPI-based applications and are not file avoiding.Finally, a group of in-situ processing tools exist that do not support ensemble runs including DataSpace (Docan et al., 2010), Decaf (Yildiz et al., 2022), and Damaris (Dorier et al., 2012).Although all these software packages are useful for particular applications, they do not fulfill all three main tasks Melissa was built for: large scale data generation, scheduler handling, and file-avoiding data processing.

Using Melissa Installing Melissa
Melissa includes online documentation geared for new and advanced users alike.For example, installation instructions help users get started no matter which supercomputer they are working on.The typical installation is done via a cmake command.However, a spack install is also available.

Configuring Melissa
As highlighted in the documentation, running a Melissa analysis requires the user to: 1. Instrument the simulation code with the Melissa API (3 base calls: init, send, and finalize) so it can become a Melissa client.
• Typically the calls to the melissa_send() are performed inside the simulation loop.For example, each time step of a physical simulation may contain melissa_send() where it sends the physical quantities associated with domain at that time-step.This data will be the data that Melissa server collects and analyzes in an online fashion (iterative statistics or online training).
• As of now, Melissa provides an API compatible with solvers developed in the most popular HPC languages: C, Fortran, and Python.
2. Configure the analysis.This includes defining the design of experiment (i.e., how to draw the parameters for each simulation execution), selecting which statistics to compute, or specifying the Neural Network architecture, the training algorithm, and parameters in case of deep-surrogate training.
• The Melissa interface comprises two components: the configuration file (config.json)and the custom user class (custom_server.py).The configuration file is a json dictionary that contains all the study controls (e.g., number of clients to launch, which statistics to compute, batch_size, etc.) config.jsonalso contains instructions on how to execute the instrumented solver as well as all the custom launcher controls for the user's specific scheduler.Meanwhile, the custom_server.py is where a user customizes the machinery inside Melissa.For example, the custom_server.pymay include specific deep-learning training loops/network architectures, custom iterative statistics, pre-and post-processing steps for the data, intermediate logging, etc.
3. Start the Melissa launcher on the terminal or on the front-end of the supercomputer.Melissa takes care of requesting resources to execute the server and runner, monitoring the execution, and restarting failing components when necessary.

Running Melissa
After the user has instrumented their simulation code and configured their custom server, the study is launched with a single command: melissa-launcher -c config.json

Monitoring Melissa
Melissa also contains a variety of monitoring/logging features to help users track live studies and post-processes completed studies.One feature is called the melissa monitor, which is designed to run in terminals directly on supercomputers.This feature displays the number of waiting, running, terminated, and failed jobs.Meanwhile, for deep-learning studies, Melissa has tensorboard integration, which allows users to track the training loss and other custom metrics in real-time.

Melissa test suite and CI
The Melissa source code contains a robust CI, which builds the source, builds/publishes the documentation, runs unit tests, and runs full integration tests.This CI serves to maintain code quality while advancing developments in an open-source fashion between a group of developers.

Examples and exhibits
Melissa was already successfully coupled with state-of-the-art PDE solvers (e.g., Code-Saturne, FEniCS) and the source code provides ready to use examples of the heat equation and the Lorenz system.These examples include training deep-learning surrogates using distributed GPUs, and iterative statistics.Further, Melissa includes a fully reproducible online vs offline deep learning comparison.Finally, if users seek active support, they are encouraged to join our Discourse forum and ask questions to the development team.

Figure 1 :
Figure 1: Melissa architecture.Specificities of sensitivity and deep learning applications appear side by side.