hessQuik : Fast Hessian computation of composite functions

hessQuik is a lightweight software library for fast computation of second-order derivatives (Hessians) of composite functions (that is, functions formed via compositions) with respect to their inputs. The core of hessQuik is the efficient computation of analytical Hessians with GPU acceleration. hessQuik is a PyTorch (Paszke 2019) package that is user-friendly and easily extendable. The repository includes a variety of popular functions and layers, including residual layers and input convex layers, from which users can build complex models through composition. hessQuik layers are designed for ease of composition - users only need to select the layers and the package provides a convenient wrapper to compose the functions properly. Each layer provides two modes for derivative computation and the mode is automatically selected to maximize computational efficiency. hessQuik includes easy-access, illustrative tutorials on Google Colaboratory (Bisong, 2019), reproducible experiments, and unit tests to verify implementations. hessQuik enables users to obtain valuable second-order information for their models simply and efficiently.

hessQuik Building Blocks hessQuik builds complex functions constructed through composition of simpler functions, which we call layers. The package uses the chain rule to compute Hessians of composite functions, assuming the derivatives of the layers are implemented analytically. We describe the process mathematically.
Let f : R n0 → R n be a twice continuously-differentiable function defined as f = g • g −1 • · · · • g 1 , where g i : R ni−1 → R ni for i = 1, . . . , . ( Here, g i represents the i-th layer and n i is the number of hidden features on the i-th layer. We call n 0 the number of input features and n the number of output features. We note that each layer can be parameterized by weights which we can tune by solving an optimization problem. Because hessQuik computes derivatives for the network inputs, we omit the weights from our notation.

Implemented hessQuik Layers
hessQuik includes a variety of popular layers and their derivatives. These layers can be composed to form complex models. Each layer incorporates a non-linear activation function, σ : R → R, that is applied entry-wise. The hessQuik package provides several activation functions, including Sigmoid, Hyperbolic Tangent, and Softplus. Currently supported layers include the following: • singleLayer: This layer consists of an affine transformation followed by pointwise non-linearity where K and b are a weight matrix and bias vector, respectively, that can be tuned through optimization methods. Multilayer perceptron neural networks are built upon these layers.
• residualLayer: This layer differs from a single layer by including a skip connection where h > 0 is a step size. Residual layers are the building blocks of residual neural networks (ResNets) (He et al., 2016). ResNets can be interpreted as discretizations of differential equations or dynamical systems Haber & Ruthotto, 2017).
• ICNNLayer: The input convex neural network layer preserves convexity of the composite function with respect to the input features. Our layer follows the construction of (Amos et al., 2017).
• quadraticLayer, quadraticICNNLayer: These are layers that output scalar values and are typically reserved for the final layer of a model.
The variety of implemented layers and activation functions makes the task of designing a wide range of hessQuik models easy.

Computing Derivatives with hessQuik
In hessQuik, we offer two modes, forward and backward, to compute the gradient ∇ u0 f and the Hessian ∇ 2 u0 f of the function with respect to the input features. The cost of computing derivatives in each mode differs and depends on the number of input and output features. hessQuik automatically selects the least costly method by which to compute derivatives. We briefly describe the derivative calculations using the two methods.
First, it is useful to express the evaluation of f as an iterative process. Let u 0 ∈ R n0 be a vector of input features. Then, the function evaluated at u 0 is where u i are the hidden features on layer i for i = 1, . . . , − 1 and u are the output features.

Forward Mode
Computing derivatives in forward mode means building the gradient and Hessian during forward propagation; that is, when we form u i , we simultaneously form the corresponding gradient and Hessian information. We start by computing the gradient and Hessian of the first layer with respect to the inputs; that is, We compute the derivatives of subsequent layers using the following mappings for i = 1, . . . , −1 where × k is the mode-k product (Kolda & Bader, 2009) and ∇ u0 u ≡ ∇ u0 f (u 0 ) is the Hessian we want to compute. The Hessian mapping in Equation 10 is illustrated in Figure 1. For efficiency, we store ∇ ui g i+1 (u i ) when we compute the gradient and re-use this matrix to form the Hessian. Notice that the sizes of the derivatives always depend on the number of input features, n 0 . Figure 1: Illustration of Hessian computation of ∇ 2 u 0 ui+1 in forward mode. Note that for the first term, the gray three-dimensional array ∇u i gi+1(ui) is treated as a stack of matrices. Then, the same Jacobian matrix ∇u 0 ui is broadcast to each matrix in the stack, illustrated by the repeated cyan matrices. In the second term, the green matrix ∇u i gi+1(ui) is applied along the third dimension of the magenta three-dimensional array, ∇u 0 ui. Both of these operations can be parallelized and accelerated GPUs.

Backward Mode
Computing derivatives in backward mode is also known as backward propagation and is the method by which automatic differentiation computes derivatives. The process works as follows: We first forward propagate through the network without computing gradients or Hessians. After we forward propagate, we build the gradient and Hessian starting from the output layer and working backwards to the input layer. We start by computing derivatives of the final layer with respect to the previous features; that is, We compute derivatives of previous layers using the following mappings for i = − 1, . . . , 1: For efficiency, we re-use ∇ ui−1 g i (u i−1 ) from the gradient computation to compute the Hessian. Notice that the sizes of the derivatives always depend on the number of output features, n .

Forward Mode vs. Backward Mode
The computational efficiency of computing derivatives is proportional to the number of input features n 0 and the number of output features n . The heuristic we use is if n 0 < n , we compute derivatives in forward mode, otherwise we compute derivatives in backward mode. Our implementation automatically selects the mode of derivative computation based on this heuristic. Users have the option to select their preferred mode of derivative computation if desired.

Testing Derivative Implementations
The hessQuik package includes methods to test derivative implementations and corresponding unit tests. The main test employs Taylor approximations; for details, see (Haber, 2014).

Efficiency of hessQuik
We compare the time to compute the Hessian of a neural network with respect to the input features of three approaches: hessQuik (our AD-free method), PytorchAD which uses automatic differentiation following the implementation in (Huang et al., 2021), and PytorchHessian which uses the built-in Pytorch Hessian function.
We compare the time to compute the gradient and Hessian of a network with an input dimension d = 2 k where k = 0, 1, . . . , 10. We implement a residual neural network (He et al., 2016) with the width is w = 16, the depth is d = 4, and various numbers of output features, n . For simplicity, the same network architecture is used for every timing test.
For reproducibility, we compare the time to compute the Hessian using Google Colaboratory (Colab) Pro and provide the notebook in the repository. For CPU runtimes, Colab Pro uses an Intel(R) Xeon(R) CPU with 2.20GHz processor base speed. For GPU runtimes, Colab Pro uses a Tesla P100 with 16 GB of memory. We note that Colab allocates resources based on availability, and hence exact quantitative reproducibility is not guaranteed. However, we expect users to get qualitatively similar results when running on their own Colab instance or locally.
In Figure 2 and Figure 3, we compare the performance of three approaches to compute Hessians of a neural network. In our experiments, we see faster Hessian computations using hessQuik and noticeable acceleration on the GPU, especially for networks with larger input and output dimensions. Specifically, Figure 2 shows that for a model with a scalar output, the timing using the hessQuik implementation scales better with the number of input features than either of the AD-based methods. Additionally, Figure 3 demonstrates that the hessQuik timings remain relatively constant as the number of output features changes whereas the PytorchAD timings significantly increase as the number of output features increases. Note that we only compare to PytorchAD for vector-valued outputs because PytorchHessian was noticeably slower for the scalar case.

Conclusions
hessQuik is a simple, user-friendly software library for computing second-order derivatives of composite functions with respect to their inputs. This PyTorch package includes many popular built-in layers, tutorial repositories, reproducible experiments, and unit testing for ease of use. The implementation scales well in time with various input and output feature dimensions and performance is accelerated on GPUs, notably faster than automatic-differentiation-based second-order derivative computations. We hope the accessibility and efficiency of this package will encourage researchers to use and contribute to hessQuik in the future.