Implementation of Peridynamics utilizing HPX -- the C++ standard library for parallelism and concurrency

Peridynamics is a non-local generalization of continuum mechanics tailored to address discontinuous displacement fields arising in fracture mechanics. As many non-local approaches, peridynamics requires considerable computing resources to solve practical problems. Several implementations of peridynamics utilizing CUDA, OpenCL, and MPI were developed to address this important issue. On modern supercomputers, asynchronous many task systems are emerging to address the new architecture of computational nodes. This paper presents a peridynamics EMU nodal discretization implementation with the C++ Standard Library for Concurrency and Parallelism (HPX), an open source asynchronous many task run time system. The code is designed for modular expandability, so as to simplify it to extend with new material models or discretizations. The code is convergent for implicit time integration and recovers theoretical solutions. Explicit time integration, convergence results are presented to showcase the agreement of results with theoretical claims in previous works. Two benchmark tests on code scalability are applied demonstrating agreement between this code's scalability and theoretical estimations.

HPX is an open source C++ standard library a asynchronous many task run time system that focuses on high performance computing [10]. HPX provides wait-free asynchronous execution and futurization for synchronization. It also features parallel execution policies utilizing a task scheduler, which enables a fine-grained and parallelization and synchronization due to work stealing. HPX is in strict adherence to the C++ 11 and C++ 17 standard definitions. This paper presents two peridynamics models discretized with the EMU nodal discretization making use of the features of AMT within HPX. We show in this paper how to take advantage of the fine-grain parallelism arising on modern supercomputers. In addition, the scalability of the algorithm is compared to its theoretical complexity and the promised work stealing of HPX is addressed. The implementation is validated against results from classical continuum mechanics and numerical analysis.
The paper is structured as follows. Section 2 reviews peridynamics models, the EMU-ND discretization, the time integration, convergence rates for bond-based material models, and the concepts of HPX. Section 3 describes the proposed modular design and the implementation within the concepts of HPX. The parallel implementation is validated against classical continuum mechanics and numerical analysis in Section 4. The actual computational time for the HPX peridynamics implementation is benchmarked against the theoretical computation time in Section 5. Section 6 concludes this work.

Peridynamics theory
Let a material domain be Ω 0 ⊂ R d , for d = 1, 2, and 3. Peridynamics (PD) [30,26] assumes that every material point X ∈ Ω 0 interacts non-locally with all other material points inside a horizon of length δ > 0, as illustrated in Figure 1. Let B δ (X) be the ball of radius δ centered at X. When Ω 0 is submitted to mechanical loads, the material point X assumes position x(t, X) = X+u(t, X), where u(t, X) is the displacement of material point X at time t. The vector η := u(t, X ) − u(t, X) is called the bond-deformation vector and ξ := X − X denotes the initial bond vector.
Let f (t, u(t, X ) − u(t, X), X − X) denote the peridynamic force as a function of time t, bonddeformation vector u(t, X)−u(t, X), and reference bond vector X−X . The peridynamics equation f (t, u(t, X ) − u(t, X), X − X)dX + b(t, X), (2.1) where b is the external force density, and (X) is the material's mass density. Equation (2.1) is complemented by initial conditions u(0, X) = u 0 (X) andu(0, X) = v 0 (X). In contrast to local problems, boundary conditions are non-local and are specified on a boundary layer or collar Ω c that surrounds the domain Ω 0 . The non-local boundary conditions are detailed in the next section. The formulation given above depends on two-point interactions and is referred to as a bondbased peridynamics model. Bond based models allow for only two point non-local interactions and because of this the material's Poisson ratio is constrained to 0.25 [17,18]. On the other hand, state-based peridynamics models [30] allow for multi-point non-local interactions and overcomes the restriction on the Poisson's ratio. It is conveniently formulated in terms of displacement dependent tensor valued functions. Let T [t, X] be the peridynamic state at time t and material point X. The peridynamics equation of motion for a state-based model is given by The description of initial boundary value problem is completed by the non-local boundary conditions described in the next section.

Dirichlet formulation
We introduce a boundary layer Ω c of thickness δ, referred to as a collar, surrounding Ω 0 , ( see Figure 2). Let Ω 0 ∪ Ω c define the complete material domain and we prescribe a displacement field on Ω c . These are the non-local Dirichlet conditions. With the displacement field for X ∈ Ω c specified, we then solve for the displacement in X ∈ Ω 0 using the peridynamic equation of motion.
The peridynamic equation of motion for X ∈ Ω 0 reads for a bond-based model. Similarly, the equation of motion for the state-based model is formulated for X ∈ Ω 0 and given by

Neumann formulation
The non-local Neumann condition is given in terms of a body force field localized to the collar Ω c and we form X ∈Ω = Ω c ∪ Ω 0 . We seek a solution of the balance of momentum equation for X inΩ, see [26]. The peridynamic equation of motion for X ∈Ω reads for a bond-based model. Similarly, the equation of motion for the state-based model is given by

Bond-based nonlinear peridynamics model
Bond based material forces are often formulated as initially elastic and then failing abruptly; this is known as the prototypical microelastic bond model (PMB) [27,29,25]. We use a smooth version of this type of model based on a nonlinear potential (NP) where the bond force depends on the tensile strain between two points [19,20]. The bond force is initially elastic, after reaching a critical strain, softens to zero. The peridynamic force f for the NP material model is given by Here, |B δ (X)| denotes the volume of a sphere of radius δ in dimension d = 1, 2, 3. The bond strain S between two material points X and X is given by where · denotes the dot product of the two vectors. The potential energy density W δ is described by where J δ (|X − X|) is the influence function that is zero for |X − X| ≥ δ. The nonlinear function ψ(r), where r = |X − X|S 2 , is the potential function which is assumed to be smooth, positive and concave, with  , plots the bond-force as a function of a bond-strain for the prototypical micro-elastic bond-based model [25,27,29]. For a PMB material, the bond-force drops to zero once the strain S exceeds the critical value S c . Figure 3b shows the force as a function of bond strain for the NP material characterized by the potential ψ(r) = 1 − exp(−r). The figure shows that the NP material softens after a critical strain and the force smoothly decays to zero with the increasing strain.
The energy density at the material point X is given by and is a quantity to obtain stable time steps.

State-based linear peridynamics model
The state T for a specific linear material can be given by [30] T where K and µ are respectively the material's bulk and shear modulus. The weighted volume m X yields and the dilatation is given by 3 m X |ξ|edX with e = |ξ + η| − |ξ|. (2.17)

Discretization of peridynamics equations
Continuous and discontinuous Galerkin finite element methods [2], Gauss quadrature [32] and spatial discretization [5,25] are different discretization approaches for PD. Owing its to efficient load distribution scheme, the EMU nodal discretization (EMU ND) [25] is chosen for this implementation.
In the EMU ND scheme, every material point X is placed at the nodes x := {x i ∈ R 3 : i = 1, . . . , n}, (e.g. in a regular grid in the reference configuration Ω 0 , as schematized in Figure 4). The discrete nodal spacing h between x j and x i is given by In addition, all nodes x are associated with a surrounding volume V := {V i ∈ R : i = 1, . . . , n}. These volumes V recover the reference configuration volume as per The discrete bond-based equation of motion yields, for all x i ∈ Ω 0 , and the discrete state-based equation of motion yields, for all x i ∈ Ω 0 , These equations are discretized in space and need additional time integration schemes.

Time integration
When a external force b(t, x i ) is applied as a load, the system of equations is not in equilibrium and must be solved for the displacement vector u. This nonlinear system of equations can be cast under where K is the tangent stiffness matrix and r the residual vector. This system can be solved using Newton's method. One approach is to compute the entries of the tangent stiffness matrix with respect to the displacement is to use a central difference scheme [22] as The residual vector is given by and |r| is the l 2 norm of the residual vector r.

Explicit time integration scheme for dynamic problems
The the central difference scheme can be used to evolve discrete dynamic problems and the evolution is given by The velocity-verlet scheme is also used and the evolution is given by

Convergence rate for the nonlinear bond-based model
Suppose that there exists an exact solution to the peridynamic equation (2.1). Define the error as the difference between the exact and the approximate solutions. For a function f : Ω → R d , the L 2 (Ω, R d ) norm is given by We are interested in the L 2 norm of the error and how the L 2 norm of the error behaves as we change the discrete nodal spacing h. The error estimate for the nonlinear bond-based model considered in this work was developed in [13,14]. For exact and differentiable solutions, the L 2 norm of the error has been shown to converge to zero at a linear rate in both the spatial discretization h and time step ∆t for a fixed horizon. Consider a fixed horizon δ and three mesh sizes Let u h1 , u h2 , u h3 be three solutions corresponding to the three mesh sizes. Assume that the L 2 norm of the error behaves as ||u h − u exact || L 2 ≤ Ch α . Theoretically, we expect α = 1, however, α has to be computed from numerical results to validate the theoretical findings. We use H P X Figure 5: Example dependency graph.
where we used triangle inequality for L 2 norm in first step. Taking the logarithm on both sides yield α can be computed from numerical solutions and its value compared with the theoretically expected value of 1.

HPX -an open source C++ standard library for parallelism and concurrency
The HPX library [10] is a C++ standard compliant Asynchronous Many Task (AMT) run time system tailored for HPC applications. It provides well-known concepts such as data flow, futurization, and Continuation Passing Style (CPS), as well as new and unique overarching concepts. The combination of these concepts results in a unique model. The concept of futurization and parallel for loops, which are used to synchronize and parallelize, are recalled here.

Futurization
An important concept for synchronization provided by HPX is futurization. The API provides an asynchronous return type hpx::lcos::future<T>. This return type, a so-called future, is based on templates and hides the function call return type. The difference here is that the called function immediately returns, even if the return value is not computed. The return value of a future is accessible through the .get() operator that is an synchronous call and waits until the return type is computed. The standard-conform API functions hpx::wait_all, hpx::wait_any, and hpx::lcos:: future<T>::then are available for combining futures and generate parallel executions graphs [15].
A typical example dependency graph is shown in Figure 5. On the Figure, H depends asynchronously on P and X. Listing 1 provides these dependencies resolutions within HPX. First, a std::vector is utilized to collect the dependencies. Second, the computations futures are added to this vector. Note, that the compute function returns a future in both cases, immediately, and the computations are executed asynchronously. Third, a barrier with hpx::wait_all for the dependencies has to be defined before H can be computed. HPX internally ensures that the function in line 6 is only called when the the previous two futures computations are finished.
Listing 2: C++ code for storing the sum of two vectors sequentially in third vector.

Parallelization
Consider the addition of n elements for the two vectors p and x, where the sum is stored piece-wise in vector h. Listing 2 shows the sequential approach for this computation while Listing 3 shows the same computational task but the sum is executed in parallel. The for loop is replaced with hpx::parallel::for_loop. The first argument defines the execution policy while the second and third define the loop range. The last argument is the lambda function, executed in parallel for all i ranging from 0 to n. Note that only two lines of codes are required to execute the code in parallel. The parallel for loop can be combined with futurization for synchronization. Therefore, the parallel execution policy is changed to hpx::parallel::execution::par(hpx::parallel::execution::task).
The future can now be synchronized with other futures. Listing 4 shows an example for synchronization. Vectors p and x should be independently manipulated before the pairwise sum is computed. Therefore, the execution policy is changed and the futures of the manipulations are synchronized with the third future in line 27. Here, the hpx::wait_all ensures that the manipulations are finished and then describes the sum's dependencies.

Implementation of PeridynamicHPX
PeridynamicHPX is an modern C++ code utilizing HPX for parallelism. The classes design is modular and template-based to easily extend the code with new material models. The code's design is presented herein and parallelism and concurrency utilization within HPX is detailed. Figure 6 shows the modular design class. PeridynamicHPX contains three modules that are affected by the discretization extensions and material models. First, the Deck module handles the loading of the discretization and the configuration in the YAML Ain't Markup Language file format. Each new Deck inherits the common functions from the AbstractDeck and is extended with the new problem/material specific attributes.

Design with respect to modular expandability
Second, the abstractions for bond-based and state-based materials are provided in the Material module. The nonlinear bond-based elastic material in Section 3.2.1 and the linear state-based elastic material of Section 3.2.2 were implemented. The abstract material must be inherited and the abstract methods, e.g. force and strain, are implemented if a new material model is to be 1 std :: vector < hpx :: lcos :: future < void > > dep ; 2 3 dep . push_back ( hpx :: parallel :: for_loop ( 4 hpx :: parallel :: execution :: par ( 5 hpx :: parallel :: execution :: task ) , 6  implemented.
The different time integration schemes and the discretizations are considered third. All new Problem classes inherit their common and abstract functions from AbstractProblem. The schemes of Section 2.5.1 and Section 2.5.2 were implemented. Note that a problem implementation takes the abstract classes as arguments. The specific implementation can therefore be used for statebased and bond-based materials.
The design aims to hide most of the HPX specific features and making the code easy for adding new material models by implementing the abstract classes. For new problem classes, the user must to deal with the parallel for loops instead of using the C++ standard for loop. Thus, the code is accessible to users that do not have advanced programming experience in parallel computing, but still yields acceptable scalability without optimization.

Parallelization with HPX
The implementation of bond-based material and the explicit time integration is adapted from earlier one-dimensional code developed by the second author [14]. The implementation of the state-based material and the implicit time integration is adapted from [22]. These sequential algorithms are analyzed to make use of HPX parallelism. The use of HPX tools to achieve parallelism is discussed in the sequel.

Nonlinear elastic bond-based material
The internal force density and strain energy are computed for a peridynamic NP material, as described in Section 2.
Compute force at x i using (2.8) 7: Compute strain energy at x i using (2.12) 9: end for 11: end 12: hpx::wait all(f1) reference configuration is computed at the beginning. The force and energy at each mesh node are computed by adding the forces and energies from the neighboring nodes.

Linearly elastic state-based material
The internal force density and strain energy are computed for a state based elastic peridynamic solid material [30], as described in Section 2.3.2. Algorithm 2 shows the adapted algorithm [22] parallized and synchronized with HPX. First, the weighted volume m is computed for each node in parallel. Second, the dilation θ is computed for each node in parallel. The internal force density and the strain energy can be computed independently from each other. Therefore, the execution policy hpx::parallel::execution::task is utilized to obtain futures back of these two loops. Third, the internal force is computed asynchronously. Fourth, the strain energy is computed asynchronously, when needed, e.g. for output. A synchronization for these two futures is needed before the force and strain energy are computed in future steps. Figure 7 shows the implicit integration implementation flow chart. The external force b is updated for each load step s. Next, the residual (2.24) and its norm are computed and compared against the tolerance τ . If the residual is too large, the tangent stiffness matrix (2.23) is assembled, (see Algorithm 3). The displacement of the previous load step was used to assembly the first matrix K(u). Line 6 perturbs the displacement by ±υ, where υ is infinitesimally small. Line 9 computes the internal forces f ±υ and Line 13 evaluates the central difference to construct the stiffness matrix K(u). Note that the nodes neighborhood B δ is represented and has several zero entries where nodes do not have neighbors. Next, the guessed displacement is updated with the solution from solving Ku = −r. The residual is evaluated once and the Newton method is iterated until |r| ≤ τ .

Implicit time integration
The high-performance open-source C++ library Blaze [11,12] was used for matrix and vector operations. Blaze supports HPX for parallel execution and can be easily integrated. The library BlazeIterative 1 was used for solving Ku = −r. The Biconjugate gradient stabilized method (BiCGSTAB) or the conjugate gradient (CG) solver were used for solving.   Algorithm 3 Assembly of the tangent stiffness matrix by central finite difference. Adapted from [22] 1: K d·n×d·n = 0 Set matrix to zero 2: parallel for i < n do 3:

Explicit time integration
Evaluate force state under perturbations of displacement 5: for each displacement degree of freedom r at node j do 6: for each degree of freedom s at node k do 13: K sr + = f diffs /2υ  Figure 9: Sketch of the one dimensional bar benchmark test. The node on the left-hand side is clamped and its displacement is set to zero. A force F is applied on the node at the right-hand side.
forces and energies corresponding to displacement u k . The velocity-verlet algorithm is used to compute the velocity at k + 1/2 and displacement u k+1 . Line 12 invokes the Material class again to compute the forces at new displacements u k+1 . The velocity at k + 1 is computed with the updated forces.

Validation of PeridynamicHPX
In this section we demonstrate the convergence of HPX implementations for both implicit and explicit schemes.

One dimensional tensile test
Consider the simple geometry of Figure 9 for comparing the one dimensional implicit time integration against a classical continuum mechanics (CCM) solution. The node on the left-hand side is clamped and displacement is set to zero. A force F is applied to the node at the right-hand side.
The strain, stress, and strain energy for this configuration are compared with the values obtained from classical continuum mechanics (CCM) where σ = E ·ε, where σ is the stress, E is the Young's modulus and ε is the strain. The stress σ = F /S, is the defined by the force F per cross section S. Thus, the strain is obtained by ε = σ /E = F /(SE). For a force F of 40 N, a cross section of 1 mm 2 and a Young's modulus of 4 GPa, the resulting strain reads ε CCM = 0.01 an the stress is of Loop over time steps 2: for 0 ≤ k < N do 3: Compute peridynamic force f k , body force b k , external force f k ext , and total energy U k total 4: 5: parallel for i < n do 6: Update boundary condition for time t = (k + 1)∆t

Compute velocity
12:  Check if ∆t corresponds to stable time step 19: if U k+1 total > U k total then 20: Output message The bar was discretized with 33 nodes with a nodal spacing h of 500 mm. The horizon was δ = 1000 mm. The tolerance τ for the Biconjugate gradient stabilized method (BiCGSTAB) was 1 × 10 −9 . Figure 10 shows that stresses, strains and strain energy match perfectly the theoretical values inside the bar but all these quantities diverge at the boundaries. These effects are the well-known surface effects within the EMU nodal discretization. Figure 11 shows the two-dimensional tensile benchmark. The line of nodes of the right-hand side of the plate are clamped in x-direction and y-direction. On each of the nodes of the line at the left-hand side a force of force F = −50 kN in x-direction was applied. The displacement of a node x i for a tensile behavior can be determined with CCM as follows

Two dimensional tensile test
where F is applied and W and H are respectively the plate's width and height.
H and W were set to 375 mm and h = 25 mm. The tolerance for the BiCGSTAB solver was τ = 1 × 10 −3 . The m-value was, 4, which means that 2m + 1 nodes are within [x i − δ, x i + δ]. Table 1 lists the actual position at the node in the center of the plate x mid and its comparison with the one from CCM (4.1). The relative error for the actual position in x-direction is sufficiently small. With the applied boundary conditions the displacement at the point in the center of the plate in y-direction is zero.

Explicit
The theoretical convergence presented in Section 2.6 is now compared to actual HPX implementations.

One dimensional
The linearized bond-based peridynamic force is given by Here the non-linear force model defined by the potential function ψ is linearized by retaining the first term in its taylor series about zero. For convenience, we refer to, the peridynamics solution with, a linearized force as LP and that with a nonlinear force as NP. Initial condition 2(IC 2): The initial condition u 0 and v 0 are described as  The bump in the plot is due to wave reflecting at the boundary. For rapidly varying (spatially) displacement field, the length scale at which displacement varies is small, and to capture it more accurately the mesh size should be smaller than the length scale associated to the displacement field. We, thus, see from the plot of rate that near wave reflection time the rate for finer mesh is closer to the theoretical rate of 1. Similar results, not shown here, were obtained for the nonlinear model of equation (2.1). The convergence results presented in Figure 12 agree with the theoretical convergence rate, which suggests that the implementation is robust.

Two dimensional
Let Ω = [0, 1] 2 be the material domain with a horizon δ = 0.1. The 2-d vector X is written X = (X 1 , X 2 ) where X 1 and X 2 are the components along the x and y axes. The time domain is taken to be [0, 2] and ∆t = 10 −5 is the time step. The influence function is J δ (r) = 1 for r < δ and 0 otherwise. The rateᾱ is computed for three mesh sizes h = δ/2, δ/4, δ/8. Both Dirichlet boundary conditions as well as mixed of Dirichlet and Neumann boundary conditions are used in the convergence analysis.
1. Dirichlet formulation: Let the initial condition on displacement vector u 0 = (u 0,1 , u 0,2 ) and velocity vector v 0 = (v 0,1 , v 0,2 ) be  The boundary conditions given in this example are a mix of Dirichlet and Neumann boundary conditions. Figure 15 showsᾱ with respect to time for the nonlinear (NP) peridynamics solution. The rate of 1 was proved theoretically for the Dirichlet type formulation, see [13,14], and the results in Figure 14 and Figure 15 validate the theoretical study.

Implicit
The benchmark is realized by comparing the the computational time to the theoretical complexity. The theoretical complexity is the number of operations an algorithm requires to perform its task. The computational time is the measured physical time the program required to complete its tasks.
The algorithm in Figure 7 features several loops for which the maximal amount of iterations can be estimated with O(n 2 ), with n being the number of discrete nodes. The computational complexity of the conjugate gradient (CG) solver for k iterations can be estimated with O(kn 2 ). The complexity for the computational time can be approximated with O( n 2 +km p ), where p is the amount of CPUs.
The test case of Section 4.1.2 served as benchmark for the two dimensional implicit time integration. Figure 17 Figure 16a shows the obtained computational time for up to 6 CPUs. The figure shows that the computational time and theoretical complexity followed similar trends. Note that only parallel for loops, synchronization, and futurization were utilized for parallelization. Figure 16b shows the CPUs idle-rates. The idle-rate is obtained with the performance counters within HPX and measured, in percentage, how long the CPU was idling with respect to the overall computation time. Here, the idle-rate is 0.01%, for the default execution policy. These observations suggest that the implicit time integration seems to scale with the same trend as the theoretical complexity without any code optimization. More sophisticated execution policies (e.g. dynamic or adaptive chunk size) could be applied, to decrease the computational time.

Explicit
The setup presented in Section 4.2 was discretized with 196249 nodes and an horizon of 0.05 m and m = 20 as chosen. The benchmark was run on CentOS 7 with kernel 3.10.0 on Intel(R) Xeon(R) CPU E5-2690 @ 2.90GHz. HPX (82f 7b281) and PeridynamicHPX were compiled with gcc 7.2, boost 1.61 and blaze 3.2 libraries. Figure 18a shows the measured computational time for up to 8 CPUs. The algorithm in Figure 8 combines several loops over twice the amount of nodes n, which can be estimated by O(n 2 ) and for parallel for loops by O( n 2 p ). The computational time shows the same behavior as the theoretical complexity. Figure 18b shows the idle rate. The idle rate is independent of the amount of CPUs and work is well distributed with the default execution policy.

Conclusion
Bond-based and state-based elastic peridynamic material models and implicit and explicit time integration schemes were implemented within a asynchronous many task run time system. These run time systems, like HPX, are essential for utilizing the full performance of the cluster with many cores on a single node.
One important part of the design was the modular expandability for the extensions. New material models can be integrated into the code by inheriting the functions of an abstract class. Consequently, only the material specific functionality, like forces or strain, is provided by the user and implementation details for the parallelism and concurrency are hidden as much as passable by the user. Additional HPX-related functionality needs to be considered for the extension to other integration schemes.
Materials models and the different time integration schemes were validated against theoretical solutions and classical continuum mechanics. All are in good agreement with reference solutions. The convergence rate was shown to be closer to theoretical value, which suggests that the code behaves as expected. The solutions converge to the exact solution at a rate.
The code scaling with respect to computational resource is important and our benchmark results show that the scaling achieved is very close to the theoretical estimates. Both integration schemes were compared against theoretical estimations. The trend of the theoretical estimates fits with measured computational time and both integration schemes scale with increasing amounts