thresholdmodeling: A Python package for modeling excesses over a threshold using the Peak-Over-Threshold Method and the Generalized Pareto Distribution

Extreme value analysis has emerged as one of the most important disciplines for the applied sciences when dealing with reduced datasets and when the main idea is to extrapolate the observations over a given time. By using a threshold model with an asymptotic characterization, it is posible to work with the Generalized Pareto Distribution (GPD) (Coles, 2001) and use it to model the stochastic behavior of a process at an unusual level, either a maximum or minimum. For example, consider a large dataset of wind velocity in Florida, USA, during a certain period of time. It is possible to model this process and to quantify extreme events’ probability, for example hurricanes, which are maximum observations of wind velocity, in a time of interest using the return value tool.


Summary
Extreme value analysis has emerged as one of the most important disciplines for the applied sciences when dealing with reduced datasets and when the main idea is to extrapolate the observations over a given time. By using a threshold model with an asymptotic characterization, it is posible to work with the Generalized Pareto Distribution (GPD) (Coles, 2001) and use it to model the stochastic behavior of a process at an unusual level, either a maximum or minimum. For example, consider a large dataset of wind velocity in Florida, USA, during a certain period of time. It is possible to model this process and to quantify extreme events' probability, for example hurricanes, which are maximum observations of wind velocity, in a time of interest using the return value tool.
In this context, this package provides a complete toolkit to conduct a threshold model analysis, from the beginning phase of selecting the threshold, going through the model fit, model checking, and return value analysis. Moreover, statistical moments functions are provided. In case of extremes of dependent sequences it is also possible to conduct a declustering analysis.
In a software context, it is possible to see a strong community working with R packages like POT (Ribatet & Dutang, 2019), evd (Stephenson, 2018), and extRemes (Gilleland, 2019) that are used for complete extreme value modeling. In Python, it is possible to find the scikit-extremes (Correoso, 2019), which does not contain threshold models yet. Another package is scipy, which has the genpareto (Scipy, 2019) functions, but this does not provide any Peak-Over-Threshold modeling functions since it is not possible to define a threshold using this package. Moreover, this package brings to the community of scientists, engineers, and any other interested person and programmer, the possibility to conduct an extreme value analysis, using a strong, consolidated and high-level programming language, given the importance of the extreme value theory approach for statistical analysis in corrosion engineering (see Scarf & Laycock (1994) and Tan (2017)), hydrology (see Katz, Parlange, & Naveau (2002)), enviromental data analysis (see Rydman (2018) and Bommier (2014)) and many other fields of natural sciences and engineering. (For a large number of additional applications, see Coles (2001) Hence, the thresholdmodeling package presents numerous functions to model the stochastic behavior of an extreme process. For a complete introduction to the complete fifteen package functions, it is crucial to go to the Functions Documentation on the GitHub page.

Package Features Threshold Selection
• Mean Residual Life Plot: It is possible to plot the Mean Residual Life function as it is defined in Coles (2001); • Parameter Stability Plot: Also, it is possible to obtain the two parameter stability plots of the GPD: the Shape Parameter Stability Plot and the Modified Scale Parameter Stability Plot, which is defined from a reparametrization of the GPD scale parameter.
(See Coles (2001) for a complete theoretical introduction about these two plots.)

Model Fit
• Fit the GPD Model: Fitting a given dataset to a GPD model using some fit methods (see Model Fit).

Model Checking
• Probability Density Function, Cumulative Distribution Function, Quantile-Quantile and Probability-Probability Plots: Plots the theoretical probability density function with the normalized empirical histograms for a given dataset, using some bin methods (see gpdpdf). Also, the theoretical CDF in comparison to the empirical one with the Dvoretzky-Kiefer-Wolfowitz confidence bands can be drawn. In addition, The QQ and PP plots, comparing the sample and the theoretical values can be obtained, where the first uses the Kolmogorov-Smirnov Two Sample Test for getting the confidence bands while the second uses the Dvoretzky-Kiefer-Wolfowitz method; • L-Moments Plots: L-Skewness against L-Kurtosis plot for a given threshold values using the Generalized Pareto parametrization. Be warned, L-Moments plots are really difficult to interpret. See Ribatet & Dutang (2019) and Hosking & Wallis (1997) for more details.

Model Diagnostics and Return Level Analysis
• Return Level Computation and Plot: Computing a return value for a given return period is also possible, with a confidence interval obtained by the Delta Method (Coles, 2001). Furthermore, a return level plot is provided, using the Delta Method in order to obtain the confidence bands. In order to compare, the empirical return level plot is provided.

Declustering and Data Visualization
It is possible to visualize the data during the unit of a return period. In case of extreme dependences sequences, for a given empirical rule (number of days, for example), it is possible to cluster the dataset and, taking the maximum observation of each cluster, a declustering of maximums is done.

Further Functions
It is also possible to compute sample L-Moments, model L-Moments, non-central moments, differential entropy, and the survival function plot.

Installation
For installation instructions, see the README on the GitHub page.

Reproducibility and User's Guide
The repository on the GitHub page contains a link to the dataset: Daily Rainfall in the South-West of England from 1914 to 1962. It can be used to test the software in order to verify its results and compare it with the forseen ones in Coles (2001). For a more detailed tutorial of using of each function, go to the Test directory.
A minimal simple example on how to use the software and get some of the results presented by Coles (2001)   Also, for the given return period (100 years), the software presents the following results in the terminal: The return value for the given return period is 106.3439 ± 40.8669 For more details, the documentation on the GitHub page is up-to-date.