sktime: [BUG] unexpected "incompatible `window_length`" expection when using `ForecastingGridSearchCV`

@yarnabrina

Describe the bug A grid serarch cv returns the following error: ValueError: The window_length and fh are incompatible with the length of y

To Reproduce

from sktime.forecasting.compose import DirectTimeSeriesRegressionForecaster, DirectTabularRegressionForecaster
from sktime.forecasting.model_selection import ForecastingGridSearchCV, temporal_train_test_split, SlidingWindowSplitter
from sktime.performance_metrics.forecasting import mean_absolute_percentage_error
from sktime.regression.interval_based import TimeSeriesForestRegressor
from sktime.forecasting.base import ForecastingHorizon

# Load the data and split into train and test sets
y = load_airline()
y_train, y_test = temporal_train_test_split(y=y, test_size=0.2)

# Create a TimeSeriesForestRegressor
estimator = TimeSeriesForestRegressor(min_interval=2, n_estimators=100, n_jobs=-1, random_state=0)

window_length = 6

# Create a DirectTimeSeriesRegressionForecaster using the TimeSeriesForestRegressor
# forecaster = DirectTimeSeriesRegressionForecaster(estimator=estimator, window_length=12)
forecaster = DirectTabularRegressionForecaster(estimator=estimator, window_length=12)

# Create a ForecastingGridSearchCV object to perform grid search
param_grid = {"window_length": [6, 8, 10, 12]}
fh = ForecastingHorizon([1,2,3])
cv = SlidingWindowSplitter(fh=fh,window_length=window_length,initial_window=24, step_length=12)

n_splits = cv.get_n_splits(y)
print(f"Number of Folds = {n_splits}")

gscv = ForecastingGridSearchCV(forecaster=forecaster, cv=cv, param_grid=param_grid, error_score='raise')

# Fit the forecaster to the training data
gscv.fit(y=y_train, fh=fh)

# Generate forecasts for the test data
y_pred = gscv.predict(fh=fh)

# Evaluate the forecasts using the sMAPE loss function
print(f"MAPE loss: {mean_absolute_percentage_error(y_true=y_test, y_pred=y_pred):.4f}")

Expected behavior I expect a GridSearchCV to be performed on the time series.

Versions System: python: 3.10.12 (main, Jun 7 2023, 12:45:35) [GCC 9.4.0] executable: /usr/bin/python3 machine: Linux-5.15.107±x86_64-with-glibc2.31

Python dependencies: pip: 23.1.2 sktime: 0.19.2 sklearn: 1.2.2 skbase: 0.4.6 numpy: 1.22.4 scipy: 1.10.1 pandas: 1.5.3 matplotlib: 3.7.1 joblib: 1.2.0 statsmodels: 0.13.5 numba: 0.56.4 pmdarima: None tsfresh: None tensorflow: 2.12.0 tensorflow_probability: 0.20.1

About this issue

  • Original URL
  • State: open
  • Created a year ago
  • Comments: 18 (4 by maintainers)

Most upvoted comments

How do I tune this so that I do not run into this error in future? Can you re-write my code snippet so that I see where I missed it?

you can avoid it by following as per what @fkiraly has brilliantly summarised the main points of the discussion but for clarity in terms of code snippet is as below ( … means same as original code snippet)

... 
# Create a TimeSeriesForestRegressor
estimator = TimeSeriesForestRegressor(min_interval=2, n_estimators=100, n_jobs=-1, random_state=0)

window_length = 15 # 12 (longest window length) + 3 (forecast horizon)

# Create a DirectTimeSeriesRegressionForecaster using the TimeSeriesForestRegressor
# forecaster = DirectTimeSeriesRegressionForecaster(estimator=estimator, window_length=12)
forecaster = DirectTabularRegressionForecaster(estimator=estimator, window_length=12)
...

plot_windows I think does something similar, although there are only two windows in there, not four. Maybe a variant of that could be used for illustration? I think the example from the top is a really nice illustration of what the user would have to deal with without sktime, and how sktime at least ensures interface consistency.

This is for the user tutorial, right?

I think this is a nice didactic example, maybe a picture of what is going on with the multiple windows in here might be illustrative.

There are time-contiguous splits, and there is a nested hierarchy of four:

  1. the initial train/test split
  2. train is split again, window-sliding, giving train/test for the tuning
  3. then, the train from 2 is using a sliding window again to tabulate for the reduction forecaster

At each level, any window must be smaller than the previous window - in the context of the above, your windows in 3 are larger than the windows in 2 so it complains. As pointed out above, increasing the window_length in the SlidingWindowSplitter (but not the others) should ensure the sizes are the right way round (decreasing).

Hi @onyekaugochukwu, I’ll give a poor attempt to explain as I understand now myself, and then @fkiraly / @hazrulakmal / @llllkihn can add/modify/delete where I got the stuff wrong. I’m not confident as I haven’t used these myself before, and hopefully trying to explain and then being corrected will improve my understanding.

As you have checked above, you’re right that the error condition is not true for your entire dataset. That is true, but that’s not what is being passed to the dataset. Since you’re doing cross validation, you are only passing a subset of data to the dataset.

>>> print([(len(y_train), len(y_test)) for y_train, y_test in self.split(y.index)])  #added inside BaseSplitter.split_series

[(24, 3), (6, 3), (6, 3), (6, 3), (6, 3), (6, 3), (6, 3), (6, 3)]

Your entire data has n_timepoints=144, but these splits do not. They have way less data points, and that is by design of cross validation. If you run your check on each of these, there you should get this will yield an error for all of those, if you use fh_max=44.

Let me know if it made any sense, and any corrections are more than welcome.

FYI @hazrulakmal, as you were recently working on this.