darts: [BUG] LightGBM past covariates look-ahead bias

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from darts.datasets import EnergyDataset
from darts.models import LightGBMModel
from darts import TimeSeries

Describe the bug

I am concerned that there is a specific corner case of look-ahead bias. This seems to be the case with the following configuration:

  • model == LightGBMModel
  • output_chunk_length < horizon
  • past_covariates != None

To Reproduce

After loading some arbitraty data, we will generate two versions of past_covariates series:

  1. pc that has not been manipulated and consisting of pc_train and pc_test series
  2. pc_zeros in which pc_test has been floored to zero
data = EnergyDataset().load().pd_dataframe().head(1000)

y = data['total load actual'].fillna(0)
pc = data['generation biomass'].fillna(0)

length = 400
y_train, y_test = y.iloc[:length], y.iloc[length:]
pc_train, pc_test = pc.iloc[:length], pc[length:]

zeros = np.zeros(600)
zeros = pd.Series(zeros, index=pc_test.index)
pc_zeros = pd.concat([pc_train, zeros])

pc.iloc[390:410].plot(label='pc');
pc_zeros.iloc[390:410].plot(label='pc_zeros');
plt.legend();

output_3_0

# Convert to TimeSeries
pc = TimeSeries.from_series(pc)
y_train = TimeSeries.from_series(y_train)
pc_train = TimeSeries.from_series(pc_train)
pc_test = TimeSeries.from_series(pc_test)
pc_zeros = TimeSeries.from_series(pc_zeros)

Next we train a model where output_chunk_length == horizon. Then we predict 24 steps ahead using both pc and pc_zeros as an input. As expected, this produces identical results since manipulating past_covariates in prediction window should not affect the results.

model = LightGBMModel(
    lags=10, 
    lags_past_covariates=10, 
    output_chunk_length=24,
    random_state=0)

model.fit(series=y_train, past_covariates=pc_train);

y_pred = model.predict(24, past_covariates=pc)
y_pred_noise = model.predict(24, past_covariates=pc_zeros)

y_pred.plot(label='pc')
y_pred_noise.plot(label='pc_zeros')

output_6_0

However, when output_chunk_length < horizon, the results are markedly different:

model = LightGBMModel(
    lags=10, 
    lags_past_covariates=10, 
    output_chunk_length=1,
    random_state=0)

model.fit(series=y_train, past_covariates=pc_train);

y_pred = model.predict(24, past_covariates=pc)
y_pred_noise = model.predict(24, past_covariates=pc_zeros)

y_pred.plot(label='pc')
y_pred_noise.plot(label='pc_zeros')

output_8_0

Conclusion

It seems that there is a look-ahead bias with specific LightGBM configuration. This is somewhat dangerous given that for LightGBMModel the output_chunk_length is an optional parameter defaulting to 1 and there is no warning in the current class documentation:

output_chunk_length (int) – Number of time steps predicted at once by the internal regression model. Does not have to equal the forecast horizon n used in predict(). However, setting output_chunk_length equal to the forecast horizon may be useful if the covariates don’t extend far enough into the future.

Proposed solution

As a quick fix, enforce output_chunk_length==horizon.

System

  • Python version: 3.8.13
  • Darts version: 0.24.0

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Reactions: 1
  • Comments: 17 (3 by maintainers)

Most upvoted comments

Auto-regression is only used when horizon > output_chunk_length. Up until output_chunk_length, the model predicts directly/in one go and only uses the past values of past covariates.

If you don’t have future values of your past covariates, it is not possible to predict n > output_chunk_length. The model raises an error in this case that your past covariates are too short. So in the real-life scenario it is not possible the have a look-ahead bias with these features (assuming they are not known in advance. If they are however known in advance then we can predict auto-regressively).

There are some models which do not accept future covariates (such as NBEATSModel, BlockRNNModel, …). They do however accept past covariates. We can still use any potential future covariates as past_covariates with these models, and then perform auto-regression when n > output_chunk_length.

Hi @dennisbader and thank you for looking into this with this level of detail.

Unfortunately, I still believe the behaviour is bad regardless if it is intended or not. If we first define that:

past covariates are (by definition) covariates known only into the past (e.g. measurements)

and then excplicitly state that:

Models do not use the future values of past_covariates when making forecasts.

we then directly violate both of these statements when we use future values of past_covariates in prediction window. I don’t see any reason why we should ever do so. past_covariates and future_covariates are useful and clearly defined abstractions:

image

Following these definitions, when you then say that:

If I do know the future values of my past_covariates ahead of time (e.g. some datetime attributes like the value of the month, day, …) then we think it is fair to use them for predicting with

arent those variables then by definition future_covariates? You give datetime attributes as an example but shouldn’t these variables belong to future_covariates category anyways?

I still believe this is an issue of the source code, not the documentation.

I agree with the fact that this is 1. confusing, 2. can lead to different behaviour in production than anticipated.

If I deploy such model with output_chunk_length=1 and my horizon is, say, 24, the model expects me to supply future values of the past_covariates. This is impossible in the same way as I cannot supply the future values of the target - that is what I’m predicting.

In your example you “solve” this by appending forecasts to past_covariates… which makes them future covariates anyway… https://unit8co.github.io/darts/userguide/torch_forecasting_models.html#id4:~:text=temperature %3D temperature.concatenate(temperature_forecast%2C axis%3D0)

For now I will use multi_models = True and output_chunk_length=1, horizon=24. this way all of the horizons are predicted with actual past values - but I am losing the auto-regressive option.

Screenshot 2023-08-09 at 13 54 10@2x

I’m also still very interested in hearing a maintainer’s take on this. As far as I know hrzn no longer works on the project. I know from Gitter that @madtoinou is aware of this issue, maybe he could look into it.