darts: [BUG] LightGBM past covariates look-ahead bias
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from darts.datasets import EnergyDataset
from darts.models import LightGBMModel
from darts import TimeSeries
Describe the bug
I am concerned that there is a specific corner case of look-ahead bias. This seems to be the case with the following configuration:
model == LightGBMModeloutput_chunk_length < horizonpast_covariates != None
To Reproduce
After loading some arbitraty data, we will generate two versions of past_covariates series:
pcthat has not been manipulated and consisting ofpc_trainandpc_testseriespc_zerosin whichpc_testhas been floored to zero
data = EnergyDataset().load().pd_dataframe().head(1000)
y = data['total load actual'].fillna(0)
pc = data['generation biomass'].fillna(0)
length = 400
y_train, y_test = y.iloc[:length], y.iloc[length:]
pc_train, pc_test = pc.iloc[:length], pc[length:]
zeros = np.zeros(600)
zeros = pd.Series(zeros, index=pc_test.index)
pc_zeros = pd.concat([pc_train, zeros])
pc.iloc[390:410].plot(label='pc');
pc_zeros.iloc[390:410].plot(label='pc_zeros');
plt.legend();
# Convert to TimeSeries
pc = TimeSeries.from_series(pc)
y_train = TimeSeries.from_series(y_train)
pc_train = TimeSeries.from_series(pc_train)
pc_test = TimeSeries.from_series(pc_test)
pc_zeros = TimeSeries.from_series(pc_zeros)
Next we train a model where output_chunk_length == horizon. Then we predict 24 steps ahead using both pc and pc_zeros as an input. As expected, this produces identical results since manipulating past_covariates in prediction window should not affect the results.
model = LightGBMModel(
lags=10,
lags_past_covariates=10,
output_chunk_length=24,
random_state=0)
model.fit(series=y_train, past_covariates=pc_train);
y_pred = model.predict(24, past_covariates=pc)
y_pred_noise = model.predict(24, past_covariates=pc_zeros)
y_pred.plot(label='pc')
y_pred_noise.plot(label='pc_zeros')
However, when output_chunk_length < horizon, the results are markedly different:
model = LightGBMModel(
lags=10,
lags_past_covariates=10,
output_chunk_length=1,
random_state=0)
model.fit(series=y_train, past_covariates=pc_train);
y_pred = model.predict(24, past_covariates=pc)
y_pred_noise = model.predict(24, past_covariates=pc_zeros)
y_pred.plot(label='pc')
y_pred_noise.plot(label='pc_zeros')
Conclusion
It seems that there is a look-ahead bias with specific LightGBM configuration. This is somewhat dangerous given that for LightGBMModel the output_chunk_length is an optional parameter defaulting to 1 and there is no warning in the current class documentation:
output_chunk_length (int) – Number of time steps predicted at once by the internal regression model. Does not have to equal the forecast horizon n used in predict(). However, setting output_chunk_length equal to the forecast horizon may be useful if the covariates don’t extend far enough into the future.
Proposed solution
As a quick fix, enforce output_chunk_length==horizon.
System
- Python version: 3.8.13
- Darts version: 0.24.0
About this issue
- Original URL
- State: closed
- Created a year ago
- Reactions: 1
- Comments: 17 (3 by maintainers)
Auto-regression is only used when
horizon > output_chunk_length. Up until output_chunk_length, the model predicts directly/in one go and only uses the past values of past covariates.If you don’t have future values of your past covariates, it is not possible to predict
n > output_chunk_length. The model raises an error in this case that your past covariates are too short. So in the real-life scenario it is not possible the have a look-ahead bias with these features (assuming they are not known in advance. If they are however known in advance then we can predict auto-regressively).There are some models which do not accept future covariates (such as NBEATSModel, BlockRNNModel, …). They do however accept past covariates. We can still use any potential future covariates as
past_covariateswith these models, and then perform auto-regression whenn > output_chunk_length.Hi @dennisbader and thank you for looking into this with this level of detail.
Unfortunately, I still believe the behaviour is bad regardless if it is intended or not. If we first define that:
and then excplicitly state that:
we then directly violate both of these statements when we use
future values of past_covariatesin prediction window. I don’t see any reason why we should ever do so.past_covariatesandfuture_covariatesare useful and clearly defined abstractions:Following these definitions, when you then say that:
arent those variables then by definition
future_covariates? You give datetime attributes as an example but shouldn’t these variables belong tofuture_covariatescategory anyways?I still believe this is an issue of the source code, not the documentation.
I agree with the fact that this is 1. confusing, 2. can lead to different behaviour in production than anticipated.
If I deploy such model with
output_chunk_length=1and my horizon is, say,24, the model expects me to supply future values of the past_covariates. This is impossible in the same way as I cannot supply the future values of the target - that is what I’m predicting.In your example you “solve” this by appending forecasts to past_covariates… which makes them future covariates anyway… https://unit8co.github.io/darts/userguide/torch_forecasting_models.html#id4:~:text=temperature %3D temperature.concatenate(temperature_forecast%2C axis%3D0)
For now I will use
multi_models = Trueandoutput_chunk_length=1,horizon=24. this way all of the horizons are predicted with actual past values - but I am losing the auto-regressive option.I’m also still very interested in hearing a maintainer’s take on this. As far as I know hrzn no longer works on the project. I know from Gitter that @madtoinou is aware of this issue, maybe he could look into it.