There's levels to this … — Occam's razor for time-series prediction

Predicting the future based on historical sequential data, i.e. time-series prediction, is becoming increasingly more desired and requested, owing to ever-increasing data collection, unprecedented computational power, and the popularity of machine learning and data science.

With a steady stream of new practitioners, easily accessible resources, software, and libraries, it’s easy for anyone to pick up a machine learning model and start predicting something. On the contrary, faithfully assessing predictability, explaining observed dynamics, and selecting the most appropriate model aren’t as straight forward tasks but indispensable in real-world applications. It was with this in mind that this post was written.

Important Questions

These days, newcomers to time-series prediction might immediately opt for a Neural Network (NN) or random-forest-type Gradient Boosting Machine (GBM) (e.g. XGB or LGBM regressors), which certainly are highly capable types of models (depending on the application and NN's architecture). And that's fine. However, some questions soon arise such as:

Q1. Is the obtained performance good?
Q2. Does the model predict anything nontrivial?
Q3. How does the model explain the data's behavior (dynamics)?

This post discusses a basic methodology―or a way of thinking―to answer the above questions. It should by no means be looked as a complete guideline or answer to the above questions but instead seen as one example of how one could go about answering them.

Answers?

All of the above questions are highly related. To explain this, let's assume that data (or state, or features) $\bf x$ evolves―from time $k$ to $k+T$―as follows:

$${\bf x }_{k+T} = {\bf f}({\bf x}_k) + {\bf g}_k(\cdot)$$

$\bf x$ is an $m$-dimensional vector of the measured features: ${\bf x }=[{\rm feat}_1, …, {\rm feat}_m]$ (the so-called state at time $k$ in dynamical systems theory). The above equation is the sum of two functions:

A function ${\bf f}$ of the state at time $k$, and
a function ${\bf g}$ that does not depend on the observed state (e.g. noise)

If ${\bf g}=0$ it would be possible to perfectly predict the future state, given ${\bf f}$. With this in mind, I'd like to propose the following answers to Q1 and Q2:

A1. As good as it gets if our approximation of the action of ${\bf f}$ is as accurate as possible
A2. Nontrivial if ${\bf f} \neq 0$ and the future state is not a linear function of previous states.

I admit A1 is not a satisfactory answer as it is. The answer needs evidence. The type of evidence we'll discuss here is numerical evidence from employing and evaluating several different types of models. If all models were to perform similarly regardless of their complexity or type, it would point to the unpredictability of the data (${\bf f}\approx 0$). Hopefully, however, one obtains a better result with a more complex model than a simpler one. At some point, a sweet spot is found where increasing model-complexity (or capacity) doesn't translate to improved performance. If we make the reasonable assumption that a less complex model is more easily interpretable, we can conjecture that the Occam's-Razor-model (OR model) for time-series prediction is found as the sweet-spot-model. Nothing fancier than that.

The answer to Q3 is model-dependent. Model interpretation is challenging for NNs if both the state space and model are large. Random-forest-type models are commonly interpreted by looking at feature importances.

Prediction Models

In the quest for an OR-like model, I find the following type of models, ordered by complexity, useful:

Level 1. Naive or Simple models
Level 2. ARIMA (Autoregressive Integrated Moving Average)
Level 3. NNs and random-forest-GBMs

Level 3 represents the models with the highest capacity. They can approximate almost any function given enough data. LGBM and a NN are used here. A standard Fully Connected NN (FCNN) is employed as a representative for the “NN-family”, but many other NN architectures exist that might perform better than this standard type. The utilized NN has a “cone-like” architecture as illustrated below. From the first hidden layer, each layer gets narrower (linearly) until the scalar output of the final layer. The number of layers, the number of neurons in the first hidden layer, and activation function type (Linear, ReLU, or Tanh) are hyperparameters that are tuned for each dataset.

Fully Connected Neural Network (FCNN) for scalar time-series prediction.

Cone-like network architecture.

ARMA is the bread-and-butter model for time-series prediction, able to capture (stationary) series that linearly depends on its previous values and noise (error terms). Therefore, it acts as a middle-ground, conventional model. By “Naive or Simple models” I here refer to the following three models:

Naive (or persistence model):

$${\bf \hat{x}}_{k+T} = {\bf x}_{k}$$

Mean (Simple 1):

$${\bf \hat{x}}_{k+T} = \sum_{i=k-d}^{k} \frac{{\bf x}_i}{d+1}$$

Mean Trend (Simple 2):

$${\bf \hat{x}}_{k+T} = {\bf x}_k + T\sum_{i=k-d}^{k} \frac{{\bf x}_i - {\bf x}_{i-1}}{d+1}$$

Data

Let's now look at three types of data to exemplify the above discourse. All data are scalar time-series, and lagged samples are used as input features (as illustrated in the depiction of the NN architecture above).

Data 1. Gaussian noise: $x_k = \eta_k, \qquad \eta_k \sim \mathcal{N}(\mu, \sigma^2)$
Data 2. Auto-regressive data (AR(2))
Data 3. Nonlinear data (Charney-DeVore equations)

For each dataset, six prediction models of various complexity are trained, optimized (where applicable the python library “hyperopt” is utilized for hyperparameter optimization), and evaluated (RMSE). The test-portion of the data is split into several segments for which RMSEs are calculated. The model with the lowest average RMSE is then compared to all other models using the Wilcoxon signed-rank test (t-test also possible if data is normally distributed).

Scalar time-series split into three segments.

Each dataset is split into three segments that make up the training, validation, and test sets, as illustrated above. The validation set is used to evaluate the models' performance for each hyperparameter configuration.

Experiment Results

Let's check the results. Example predictions of the two best models plus the naive model are shown below for each dataset. The naive predictions are plotted because they illustrate the prediction time-scale since they're just lagged copies of the measurements. The Gaussian noise is simply unpredictable fluctuations around a mean value (that hopefully, the models are able to learn). The AR-data exhibits non-decaying oscillations measured in noise and does not pose too much of a problem for several models. The LGBM model clearly outshines all others when it comes to predicting the nonlinear time-series.

As for the statistical test, I got the results shown in the table below. The model with the lowest RMSE for each dataset is marked “best”. The "Beaten"-label is given to those with significantly higher RMSE than the “best” model while “draw” is declared when the test result wasn't significant.

	gauss	ar	nonlinear
naive	Beaten	Beaten	Beaten
mean	Beaten	Beaten	Beaten
trend	Beaten	Beaten	Beaten
ar	Beaten	Beaten	Beaten
lgbm	Draw	Draw	Best
FCNN	Best	Best	Beaten

Interestingly, FCNN and LGBM both performed (slightly) better than the mean-model for predicting the mean of the Gaussian noise. However, as you can see in the figures depicting the errors (quartiles) below, the difference is minor. Similarly, for the AR-data, the AR-model, LGBM, and FCNN were all close. For the nonlinear data, LGBM proved to be substantially better than the others, achieving an almost perfect R2-score of 0.99. Thus one could argue that the mean-model, AR-model, and LGBM model, respectively, should suffice for the three datasets.

That's all. Check out the code on GitHub if interested. Leave a comment or contact me if there’s any question.

Important Questions

Answers?

Prediction Models

Data

Experiment Results

Tags