sktime: [BUG] ColumnTransformer Unnecessary ValueError in ForecastingPipeline
Describe the bug
When using ColumnTransformer in ForecastingPipeline,
- model.predict gives
ValueError: If passed as a pd.DataFrame, X must be a nested pd.DataFrame, with pd.Series or np.arrays inside cells. - model.update gives
TypeError: y must be in an sktime compatible format, of scitype Series, Panel or Hierarchical, for instance a pandas.DataFrame with sktime compatible time indices, or with MultiIndex and last(-1) level an sktime compatible time index. Allowed compatible mtype format specifications are: ['pd.Series', 'pd.DataFrame', 'np.ndarray', 'nested_univ', 'numpy3D', 'pd-multiindex', 'df-list', 'pd_multiindex_hier'] . See the data format tutorial examples/AA_datatypes_and_datasets.ipynb. If you think the data is already in an sktime supported input format, run sktime.datatypes.check_raise(data, mtype) to diagnose the error, where mtype is the string of the type specification you want. Error message for checked mtypes, in format [mtype: message], as follows
To Reproduce
from sktime.transformations.series.boxcox import LogTransformer
from sktime.transformations.series.exponent import ExponentTransformer
from sktime.datasets import load_longley
from sktime.transformations.panel.compose import ColumnTransformer
from sktime.forecasting.sarimax import SARIMAX
from sktime.forecasting.model_selection import temporal_train_test_split
from sktime.transformations.compose import OptionalPassthrough
y, X = load_longley()
y_train, y_test, X_train, X_test = temporal_train_test_split(y, X)
selected_columns_1 = ["GNP", "UNEMP"]
selected_columns_2 = ["GNPDEFL"]
pipe = ColumnTransformer(
[
("log", LogTransformer(), selected_columns_1),
("exp", ExponentTransformer(power=2), selected_columns_2),
],
remainder="passthrough", # "drop"
)
# Workaround for ValueError just for `predict` method
# pipe = OptionalPassthrough(pipe)
model = pipe ** SARIMAX()
model.fit(y_train, X_train)
# Gives ValueError if OptionalPassthrough is not used
model.predict(fh=[1, 2, 3, 4], X=X_test)
# Gives TypeError either way
model.update(y=y_test.iloc[:2], X=X_test.iloc[:2, :])
Expected behavior No ValueError should be thrown.
Additional context
Versions
python: 3.8.15 executable: /usr/bin/python machine: Linux-5.19.0-26-generic-x86_64-with-glibc2.17
Python dependencies: pip: 22.3.1 setuptools: 65.5.0 sklearn: 1.2.0 sktime: 0.15.0 statsmodels: 0.13.5 numpy: 1.22.4 scipy: 1.9.3 pandas: 1.5.2 matplotlib: 3.6.2 joblib: 1.2.0 numba: 0.56.4 pmdarima: None tsfresh: 0.20.0
About this issue
- Original URL
- State: open
- Created a year ago
- Comments: 20
Thank you for the update. Actually, I have been already following those issues. I think sktime native
ColumnTransformermake more sense and easy to develop without thinking about internal sklearn changes. Good thinking on that 👍Will test in as soon as I can.
@ilkersigirci, short update:
I’ve been trying to refactor the
ColumnTransformer, but it is too closely linked tosklearn. So, for now, I have written ansktimeversion, and I would plan to move the two interfaces together in the mid term: https://github.com/sktime/sktime/pull/4231 https://github.com/sktime/sktime/pull/4232Once that is done, we can look at cleaner syntax and composition.
Feedback (and testing) appreciated!
In my design thinking, it’s less the DRY principle than SRP - sticking logic into the base class gives it too much added responsibility. Imo it should only be responsible for interface (outside) and boilerplate (inside).
I’ll think about it a bit and try to come up with a good (and easy to implement and intuitive to use solution) - anyone can of course contribute in the meanwhile.
PS all, we recently gave a tutorial on transformers, feature union, pipelines, and the like at pydata global 2022!
Here: https://github.com/sktime/sktime-tutorial-pydata-global-2022 all the notebooks are available on binder, you will probably be interested in part 2.
Not sure if the video is already online, but if not it should be released very soon!