sktime: [BUG] ColumnTransformer Unnecessary ValueError in ForecastingPipeline

Describe the bug When using ColumnTransformer in ForecastingPipeline,

  • model.predict gives ValueError: If passed as a pd.DataFrame, X must be a nested pd.DataFrame, with pd.Series or np.arrays inside cells.
  • model.update gives TypeError: y must be in an sktime compatible format, of scitype Series, Panel or Hierarchical, for instance a pandas.DataFrame with sktime compatible time indices, or with MultiIndex and last(-1) level an sktime compatible time index. Allowed compatible mtype format specifications are: ['pd.Series', 'pd.DataFrame', 'np.ndarray', 'nested_univ', 'numpy3D', 'pd-multiindex', 'df-list', 'pd_multiindex_hier'] . See the data format tutorial examples/AA_datatypes_and_datasets.ipynb. If you think the data is already in an sktime supported input format, run sktime.datatypes.check_raise(data, mtype) to diagnose the error, where mtype is the string of the type specification you want. Error message for checked mtypes, in format [mtype: message], as follows

To Reproduce

from sktime.transformations.series.boxcox import LogTransformer
from sktime.transformations.series.exponent import ExponentTransformer
from sktime.datasets import load_longley
from sktime.transformations.panel.compose import ColumnTransformer
from sktime.forecasting.sarimax import SARIMAX
from sktime.forecasting.model_selection import temporal_train_test_split
from sktime.transformations.compose import OptionalPassthrough

y, X = load_longley()
y_train, y_test, X_train, X_test = temporal_train_test_split(y, X)

selected_columns_1 = ["GNP", "UNEMP"]
selected_columns_2 = ["GNPDEFL"]

pipe = ColumnTransformer(
    [
        ("log", LogTransformer(), selected_columns_1),
        ("exp", ExponentTransformer(power=2), selected_columns_2),
    ],
    remainder="passthrough",  # "drop"
)

# Workaround for ValueError just for `predict` method
# pipe = OptionalPassthrough(pipe)

model = pipe ** SARIMAX()
model.fit(y_train, X_train)

# Gives ValueError if OptionalPassthrough is not used
model.predict(fh=[1, 2, 3, 4], X=X_test)

# Gives TypeError either way
model.update(y=y_test.iloc[:2], X=X_test.iloc[:2, :])

Expected behavior No ValueError should be thrown.

Additional context

Versions

python: 3.8.15 executable: /usr/bin/python machine: Linux-5.19.0-26-generic-x86_64-with-glibc2.17

Python dependencies: pip: 22.3.1 setuptools: 65.5.0 sklearn: 1.2.0 sktime: 0.15.0 statsmodels: 0.13.5 numpy: 1.22.4 scipy: 1.9.3 pandas: 1.5.2 matplotlib: 3.6.2 joblib: 1.2.0 numba: 0.56.4 pmdarima: None tsfresh: 0.20.0

About this issue

Most upvoted comments

Thank you for the update. Actually, I have been already following those issues. I think sktime native ColumnTransformer make more sense and easy to develop without thinking about internal sklearn changes. Good thinking on that 👍

Will test in as soon as I can.

@ilkersigirci, short update:

I’ve been trying to refactor the ColumnTransformer, but it is too closely linked to sklearn. So, for now, I have written an sktime version, and I would plan to move the two interfaces together in the mid term: https://github.com/sktime/sktime/pull/4231 https://github.com/sktime/sktime/pull/4232

Once that is done, we can look at cleaner syntax and composition.

Feedback (and testing) appreciated!

Fair point, it may also contradict with DRY principle. I just wanted to state that in sktime this functionality already exist for some transformers.

In my design thinking, it’s less the DRY principle than SRP - sticking logic into the base class gives it too much added responsibility. Imo it should only be responsible for interface (outside) and boilerplate (inside).

I think this would make more sense. I hope it is intuitive for the end users.

I’ll think about it a bit and try to come up with a good (and easy to implement and intuitive to use solution) - anyone can of course contribute in the meanwhile.

PS all, we recently gave a tutorial on transformers, feature union, pipelines, and the like at pydata global 2022!

Here: https://github.com/sktime/sktime-tutorial-pydata-global-2022 all the notebooks are available on binder, you will probably be interested in part 2.

Not sure if the video is already online, but if not it should be released very soon!