RHEA: an open-source Reproducible Hybrid-architecture flow solver Engineered for Academia

The study of complex multiscale flows (Groen et al., 2014), like for example the motion of small-scale turbulent eddies over large aerodynamic structures (Jofre & Doostan, 2022), microconfined high-pressure supercritical fluids for enhanced energy transfer (Bernades & Jofre, 2022), or hydrodynamic focusing of microorganisms in wall-bounded flows (Palacios et al., 2022), greatly benefits from the combination of interconnected theoretical, computational and experimental approaches. This manifold methodology provides a robust framework to corroborate the phenomena observed, validate the modeling assumptions utilized, and facilitates the exploration of wider parameter spaces and extraction of more sophisticated insights. These analyses are typically encompassed within the field of Predictive Science & Engineering (Njam, 2009), which has attracted attention in the Fluid Mechanics community and is expected to exponentially grow as computational studies transition from (mostly) physics simulations to active vectors for scientific discovery and technological innovation with the advent of Exascale computing (Alowayyed et al., 2017). In this regard, the computational flow solver presented aims at bridging the gap between studying complex multiscale flow problems and utilizing present and future state-of-the-art supercomputing systems in academic environments. The solver presented is named RHEA, which stands for open-source

2022), or hydrodynamic focusing of microorganisms in wall-bounded flows (Palacios et al., 2022), greatly benefits from the combination of interconnected theoretical, computational and experimental approaches.This manifold methodology provides a robust framework to corroborate the phenomena observed, validate the modeling assumptions utilized, and facilitates the exploration of wider parameter spaces and extraction of more sophisticated insights.These analyses are typically encompassed within the field of Predictive Science & Engineering (Njam, 2009), which has attracted attention in the Fluid Mechanics community and is expected to exponentially grow as computational studies transition from (mostly) physics simulations to active vectors for scientific discovery and technological innovation with the advent of Exascale computing (Alowayyed et al., 2017).In this regard, the computational flow solver presented aims at bridging the gap between studying complex multiscale flow problems and utilizing present and future state-of-the-art supercomputing systems in academic environments.The solver presented is named RHEA, which stands for open-source Reproducible Hybridarchitecture flow solver Engineered for Academia, and is available as an open-source Git repository at https://gitlab.com/ProjectRHEA/flowsolverrhea.

Statement of need
The computational study and optimization of multiscale flow problems, like for example turbulence, requires the utilization of powerful supercomputers.In particular, the computational cost of studying wall turbulence, in terms of total number of grid points , by means of direct numerical simulation (DNS) and/or wall-modeled large eddy simulation (LES) strategies scales with the Reynolds number as  ∼  37/14 and  ∼  (Choi & Moin, 2012), respectively.Therefore, flows at high Reynolds numbers require large computational meshes solved on substantial node counts.As a result, the meshes used to discretize the computational domain on which the equations of fluid motion are solved need to be partitioned (distributed) among parallel tasks such that each of these only holds a local portion of the global mesh.The local portion of each task is composed of a set of grid points that it owns, and a set of off-processor grid points (owned by remote processors) connected with their corresponding local grid points.This overlapped mesh partition is used to exchange data among nearest neighbors by means of message passage interface (MPI) (The MPI Forum, 2022) instructions.In addition, present (and future) supercomputers are equipped with accelerators, like for example graphics processing units (GPU), interconnected with the central processing units (CPU) and the main memory at the node level.In this regard, the flow solver RHEA uses OpenACC (The OpenACC Organization, 2022) pragmas to efficiently speedup the main kernels.This computational framework has been designed with the objective to facilitate the portability of the flow solver from small-and mid-size workstations to top-tier supercomputers, while maintaining notable levels of computational performance.
There is only a handful of full-fledged accelerated open-source flow solvers available in the literature.Some selected examples for compressible flows include (i) popular multi-purpose open-source packages for simulation (Cantwell et al., 2015) and design (Economon et al., 2015), (ii) OpenSBLI (Jacobs et al., 2017) dedicated to the automated derivation of finite-difference solvers for hybrid (CPU-GPU) architectures based on a Python framework, (iii) HTR (Renzo et al., 2020) constructed from the task-based Legion programming paradigm (Stanford University, 2022) and targeted to simulate hypersonic reacting flows, and (iv) STREAmS (Bernardini et al., 2021) resulting from the extension of a CPU-based solver to run on multi-GPU clusters and focused on supersonic ideal-gas flows.In this regard, RHEA has been designed and developed with the aim to cover this void for the academic community, specifically focusing on (i) solver expressivity, (ii) generality of flow problems that can be studied in terms of thermodynamic and flow regimes, (iii) rapid portability to different architectures, (iv) computational performance, and (v) reproducibility through uploading to the repository the parameters and input files required to regenerate the data from simulations.

Computational design
RHEA solves the three-dimensional (3-D) compressible equations of fluid motion utilizing nonuniform Cartesian second-order discretizations (Moin, 2010) in combination with kinetic energy preserving (Coppola et al., 2019) convection schemes, or different Harten-Lax-van-Leer-type (HLL) Riemann solvers (Toro, 2009), and uses explicit Runge-Kutta methods (Gottlieb et al., 2001) for time integration.The set of conservation equations is closed by means of idealor real-gas thermodynamics to target, respectively, subcritical and supercritical fluid regimes (Jofre & Urzay, 2021).The solver can be utilized for a wide range of fluid mechanics problems as it allows (i) to flexibly overwrite most high-level kernels to set, for example, specific initial conditions and source terms, and (ii) select between Dirichlet, Neumann, periodic, and subsonic & supersonic inflow-outflow (Poinsot & Lele, 1992) boundary conditions.RHEA is written in C++ (Stroustrup, 2013), using object-oriented programming, utilizes YAML (The Official YAML Web Site, 2022) and HDF5 (The HDF Group, 2022) for input/output operations, and targets hybrid (CPU-GPU) supercomputing architectures based on a state-of-the-art parallel and scalable MPI (The MPI Forum, 2022) + OpenACC (The OpenACC Organization, 2022) accelerated (managed-memory) computational framework.
The computational performance of RHEA is analyzed by carrying out performance and scalability tests based on 100 time iterations of the 3-D turbulent channel flow configuration described in the next section.In this regard, to assess the portability of RHEA to different computing systems, the tests have been performed on (i) a local hybrid machine (results are referred to as Hybrid) and (ii) the Barcelona Supercomputer Center (Barcelona Supercomputing Center, 2022) (results are indicated as BSC) without performing any particular tuning of the solver.The Hybrid machine is composed of a node with 1 AMD Ryzen 9 3900XT 12-core CPU and 2 NVIDIA Quadro RTX 4000 GPUs, while the BSC supercomputer contains a CPU-GPU cluster with 3 racks formed of 54 IBM POWER9 nodes, each containing 2 Witherspoons CPUs (20 cores of 3.1 GHz each), 4 Volta NVIDIA GPUs (16 GB each), and 6.4 TB of NVMe (Non-Volatile Memory).The number of grid points utilized for the strong (total quantity) and weak (quantity per node) scalability tests correspond to 256 × 256 × 256 (Hybrid) and 384 × 384 × 384 (BSC).Three main observations can be inferred from the results.First, focusing on the time-to-solution when using CPUs+GPUs with respect to only CPUs, the solver is accelerated approximately 2.5× and 5× on the Hybrid and BSC computers.Second, as shown in Figure 1(left), on BSC for a fixed-problem size and noting that on CPUs+GPUs the solver is roughly 5× faster, the solver presents strong scalability speedups above 80% and 60%, respectively, when running on CPUs and CPUs+GPUs up to 32 nodes (640 cores and 128 GPUs).Third, maintaining the same ratio of problem size to number of nodes, RHEA is able to preserve weak scalability efficiency above 90% on CPUs and CPUs+GPUs up to 32 nodes on BSC as depicted in Figure 1.

Application example
The ability of RHEA to configure a wide range of computational flow problems and run them efficiently on powerful supercomputing systems is demonstrated by simulating the canonical 3-D turbulent channel flow problem (Smits et al., 2011) on 2 nodes of the CTE POWER9 cluster of the Barcelona Supercomputing Center (Barcelona Supercomputing Center, 2022).The friction Reynolds number selected is   =   / = 180, where   = 1 m/s is the friction velocity,  = 1 m is the channel half-height, and  = / is the kinematic viscosity of the fluid with  the dynamic viscosity and  = 1 kg/m³ the density.The Prandtl number of the problem is   =   / = 0.71 with   the isobaric specific heat capacity, and the Mach number is   =   /√  / = 0.3 with   and   the bulk velocity and pressure, respectively, and  = 1.4 the heat capacity ratio.The mass flow rate in the streamwise direction is imposed through a body force equal to f = [  /, 0, 0] ⊺ , where   is the wall shear stress.
The computational domain is 4 × 2 × 4/3 in the streamwise (), wall-normal (), and spanwise () directions, respectively.The streamwise and spanwise boundaries are set periodic, and no-slip conditions are imposed on the horizontal boundaries (- planes).The KGP convection scheme (Coppola et al., 2019) is utilized for solving the transport equations together with a third-order Runge-Kutta time-integration method.The grid is uniform in the streamwise and spanwise directions with resolutions in wall units equal to Δ + = 9 and Δ + = 6, and stretched toward the walls in the vertical direction with the first grid point at  + =   / = 0.1 and with sizes in the range 0.2 ≲ Δ + ≲ 4.This grid arrangement corresponds to a DNS of size 256 × 128 × 128 grid points.The simulation strategy starts from a linear velocity profile with random fluctuations (Nelson & Fringer, 2017), which is advanced in time utilizing the KGP convection scheme (Coppola et al., 2019) with CFL = 0.9 to reach turbulent steady-state conditions after approximately 20 flow-through-time (FTT) units; based on the bulk velocity   and the length of the channel   = 4, a FTT is defined as   =   /  ∼ /  .Flow statistics are collected for roughly 30 FTTs once steady-state conditions are achieved, and compared against reference results (Moser et al., 1999).The time-averaged mean streamwise velocity  + and root-mean-squared () velocity fluctuations  +  ,  +  ,  +  along the wall-normal direction  + in wall units provided by RHEA and compared to reference results (Moser et al., 1999) are depicted in Figure 2. As shown in the figure, the results from RHEA accurately reproduce the first-and second-order flow statistics by (i) properly capturing the inner (viscous sublayer, buffer layer and and log-law region) and outer layers, and (ii) the flow fluctuations peaking around  + ≈ 15 for the streamwise velocity.Additionally, a snapshot of the instantaneous streamwise velocity in wall units  + on a  + - + slice is displayed in Figure 3 to provide qualitative information of the wall-bounded turbulence computed by RHEA.

Figure 1 :
Figure 1: Strong scalability speedup (left) and weak scalability efficiency (right) of RHEA on BSC supercomputer using CPUs and CPUs+GPUs.

Figure 2 :
Figure 2: Comparison against reference results of the time-averaged streamwise velocity  + (left) and velocity fluctuations  +  ,  +  and  +  (right) along the wall-normal direction  + in wall units.

Figure 3 :
Figure 3: Snapshot of the instantaneous streamwise velocity  + on a  + - + slice from RHEA.