The 2DECOMP&FFT library: an update with new CPU/GPU capabilities

The


Statement of need
The 2DECOMP&FFT library (Li & Laizet, 2010) was originally designed for CPU hardware and is now used by many research groups worldwide.The library is based on a 2D-pencil decomposition for data distribution on distributed memory systems and is used as the core of many CFD solvers such as Xcompact3d (Bartholomew et al., 2020) and CaNS (Costa, 2018), with excellent strong scaling performance up to hundreds of thousands of CPU cores.2DECOMP&FFT mainly relies on MPI, and it offers a user-friendly interface that hides the complexity of the communication.Version 2.0.1 of the library also offers a 1D slab decomposition, which is implemented as a special case of the 2D decomposition.Two alternatives are possible: • Initial slabs orientation in the XY plane; • Initial slabs orientation in the XZ plane.
In many configurations the slabs decomposition gives some gain in performance with respect to the 2D-pencil decomposition.This is a consequence of having data already in memory when transposing between the two dimensions of the slab.Therefore, it is possible to perform a simple memory copy between input and output arrays instead of the full MPI communication.
The library also offers a very efficient and flexible interface to perform 3D Fast Fourier Transform (FFT) on distributed memory systems.However, 2DECOMP&FFT is mainly designed to perform data management and communication and the actual computation of the 1D FFT is delegated to 3rd-party libraries.The supported FFT backends are: FFTW (Frigo & Johnson, 2005), the Intel Math Kernel Library (MKL), and the CUDA FFT (cuFFT), which is used for FFT on NVIDIA GPUs.A Generic FFT backend, based on Glassman's general N Fast Fourier Transform (Ferguson, 1982), is also available to make the library more portable.
While the 2DECOMP&FFT library has been designed with high order compact schemes in mind, it is possible that some derivatives can be evaluated using an explicit formulation based on local stencils.For this reason a halo support API is also provided to support explicit message passing between neighbouring pencils.
Finally, the library provides infrastructure to perform parallel data I/O using MPI I/O or ADIOS2 (Godoy et al., 2020).The API provide several features such as: writing single or multiple 3D arrays into a file, writing 2D slices of the data, and data compression either via ADIOS2 or by writing reduced precision or resolution with the MPI I/O backend.
The first version of the library was released in 2010 as a tar.gz package, with a Makefile approach, and could only make use of CPUs.It has not been modified since its release.The new version of the library can now leverage NVIDIA GPUs, modern CPUs, and various compilers (GNU, Intel, NVHPC, CRAY).It has CMAKE capabilities as well as a proper continuous integration framework with automated tests.The new library was designed to be more appealing to the scientific community, and it can now be easily implemented as an independent library for use by other software.

GPU porting
An initial port of 2DECOMP&FFT to GPUs was performed within the solver AFiD-GPU (Zhu et al., 2018), which was mainly based on CUDA-Fortran for some kernels and CUDA-aware-MPI for communications.A second library, named cuDECOMP, which was directly inspired by 2DECOMP&FFT, takes full advantages of CUDA and uses NVIDIA's most recent libraries for communications, such as NVIDIA Collective Communication Library (NCCL), is presented in Romero et al. (2022).Indeed, cuDECOMP only targets NVIDIA GPUs.The updated 2DECOMP&FFT mainly uses a mix of CUDA-Fortran and openACC for the GPU porting together with CUDA-aware-MPI and NCCL for the communications.In addition to previous work, the FFT module is ported to GPUs using cuFFT.The next step is also to implement OpenMP for GPU porting to support both AMD and Intel GPU hardware.

How to use 2DECOMP&FFT
The 2D Pencil Decomposition API is defined with three Fortran modules which should be used by applications as: where use decomp_2d_constants defines all the parameters, use decomp_2d_mpi introduces all the MPI related interfaces, and use decomp_2d contains the main decomposition and transposition APIs.The library is initialised using: call decomp_2d_init (nx, ny, nz, p_row, p_col) where nx, ny, and nz are the spatial dimensions of the problem, to be distributed over a 2D processor grid   ×   .Note that none of the dimensions need to be divisible by p_row or p_col.In the case of p_row=p_col=0, an automatic decomposition is selected among all possible combinations available.A key element of this library is a set of communication routines that perform the data transpositions.As mentioned, one needs to perform 4 global transpositions to go through all 3 pencil orientations (i.e., one has to go from x-pencils to y-pencils to z-pencils to y-pencils to x-pencils).Correspondingly, the library provides 4 communication subroutines: call transpose_x_to_y(var_in,var_out) call transpose_y_to_z(var_in,var_out) call transpose_z_to_y(var_in,var_out) call transpose_y_to_x (var_in,var_out) The input array var_in and output array var_out are defined by the code using the library and contain distributed data for the correct pencil orientations.
Note that the library is written using Fortran's generic interfaces so different data types are supported without user input.That means in and out above can be either real or complex arrays, the latter being useful for applications involving 3D Fast Fourier Transforms.Finally, before exit, applications should clean up the memory by: call decomp_2d_finalize Detailed information about the decomposition API are available here.Several examples detailing the usage of the different library functionalities can be found here.