TimeseriesSurrogates.jl: a Julia package for generating surrogate data

The method of surrogate data is a way to generate data that preserve one or more statistical or dynamical properties of a given timeseries, but are otherwise randomized. Surrogate time series methods have widespread use in null hypothesis testing in nonlinear dynamics, for null hypothesis testing in causal inference, or for the more general case of producing synthetic data with similar statistical properties as an original signal. Originally introduced by Theiler et al. (1992) to test for nonlinearity in time series, numerous surrogate methods aimed preserving different properties of the original signal have since emerged; for a review, see Lancaster et al. (2018).


Introduction
The method of surrogate data is a way to generate data that preserve one or more statistical or dynamical properties of a given timeseries, but are otherwise randomized. Surrogate time series methods have widespread use in null hypothesis testing in nonlinear dynamics, for null hypothesis testing in causal inference, or for the more general case of producing synthetic data with similar statistical properties as an original signal. Originally introduced by Theiler et al. (1992) to test for nonlinearity in time series, numerous surrogate methods aimed preserving different properties of the original signal have since emerged; for a review, see Lancaster et al. (2018).
A simple example of an application of surrogates would be to distinguish whether a given timeseries x can be represented via a linear noise process, or not. The latter case can be an indication that the timeseries may represent deterministic nonlinear dynamics with additional noise. A simple way to test for this hypothesis would be to generate new timeseries from x that conserve the power spectrum of x (which is a defining feature of linear stochastic processes). Then, a discriminatory statistic, such as the correlation dimension or the automutual-information (Lancaster et al., 2018) is computed for x, but also for thousands of surrogates from x. The discriminatory statistic of the surrogates provides a distribution of possible values, and if the value for x is well within the distribution spread, then x satisfies the null hypothesis (here, that x can be approximated as a linear stochastic process).

Statement of need
Surrogate data has been used in several thousand publications so far (the citation number of Theiler et al. (1992) is more than 4,000) and hence the community is in clear need of such methods. Existing software packages for surrogate generation provide much fewer methods than available in the literature, with less-than optimal performance (see Comparison section below), and without allowing reproducible generation of surrogates. TimeseriesSurrogates.jl provides more than double the amount of methods given by other packages, with runtimes similar to and up to an order of magnitude faster than existing surrogate packages in other languages. Equally importantly, TimeseriesSurrogates.jl provides a framework that is tested via continuous integration, and is easy to extend via open source contributions.

Method Description Reference
AutoRegressive Autoregressive model based surrogates.

RandomShuffling
Random shuffling of individual data points. Theiler et al. (1992) BlockShuffle Random shuffling of blocks of data points. Theiler et al. (1992) CircShift Circularly shift the signal.

RandomFourier
Randomization of phases of Fourier transform of the signal. Theiler et al. (1992) PartialRandomization Fourier randomization, but tuning of the "degree" of randomization.
This paper.

CycleShuffle
Randomization of phases of Fourier transform of the signal.

ShuffleDimensions
Circularly shift the signal. This paper.

Miralles et al. (2015)
LS Lomb-Scargle periodogram based surrogates for irregular time grids Schmitz & Schreiber (1999) Documentation strings for the various methods describe the usage intended by the original authors of the methods. Example applications are showcased in the package documentation.

Design of TimeseriesSurrogates.jl
TimeseriesSurrogates.jl has been designed to be as performant as possible and as simple to extend as possible.
At a first level, we offer a function The function surrogate is straight-forward to use, but it does not allow maximum performance. The reason for this is that when trying to make a second surrogate from x and the same method, there are many structures and computations that could be pre-initialized and/or reused for all surrogates. This is especially relevant for real-world applications where one typically makes thousands of surrogates with a given method. To address this, we provide a second level of interface, the surrogenerator function. It works as follows: first the user initializes a "surrogate generator" structure: method = RandomShuffle() sg = surrogenerator(x, method, rng) The structure sg can generate surrogates of x on demand in the most performant manner possible for the given inputs x, method. It can be used like so: for i in 1:100 s = sg() # generate a surrogate # code... end

Comparison
The average time to generate surrogates in TimeseriesSurrogates.jl is in the best case about an order of magnitude faster than, and in the worst case roughly equivalent to, the MATLAB surrogate code provided by Lancaster et al. (2018), though comparisons are not exact, due to differing implementations and tuning options. Moreover, the code of Lancaster et al. (2018) is not an actual package, but rather scripts that have been written and circulated. As such, they lack a test suite tested via continuous integration. Timings for commonly used surrogate methods that are common to both libraries are shown in Figure 1. Additionally, because TimeseriesSurrogates.jl provides many more methods not implemented in other packages, a comprehensive comparison of runtimes is not possible, but due to our optimized surrogate generators, we expect good performance relative to future implementations in other languages.  (ft), amplitude-adjusted Fourier transform (aaft), iterated aaft (iaaft) and pseudoperiodic (pps) surrogate using a pre-initialized generators with default parameters, and using a maximum of 100 iterations for the IAAFT algorithm. MATLAB timings are generated using the code provided by Lancaster et al. (2018). Note: timings for the pseudoperiodic surrogates in MATLAB include embedding lag and dimension finding, which has been included in the preprocessing step in the Julia version. Scripts to reproduce Julia and MATLAB timings are available in the GitHub repo for this paper.