scikit-learn: Pipeline doesn't work with Label Encoder

I’ve found that I cannot use pipelines if I wish to use the label encoder. In the following I wish to build a pipeline that first encodes the label and then constructs a one-hot encoding from that labelling.

from sklearn.preprocessing import  OneHotEncoder, LabelEncoder
from sklearn.pipeline import make_pipeline
import numpy as np

X = np.array(['cat', 'dog', 'cow', 'cat', 'cow', 'dog'])

enc = LabelEncoder()
hot = OneHotEncoder()

pipe = make_pipeline(enc, hot)
pipe.fit_transform(X)

However, the following error is returned:

lib/python2.7/site-packages/sklearn/pipeline.pyc in _pre_transform(self, X, y, **fit_params)
    117         for name, transform in self.steps[:-1]:
    118             if hasattr(transform, "fit_transform"):
--> 119                 Xt = transform.fit_transform(Xt, y, **fit_params_steps[name])
    120             else:
    121                 Xt = transform.fit(Xt, y, **fit_params_steps[name]) \

TypeError: fit_transform() takes exactly 2 arguments (3 given)

It seems that the problem is that the fit method for label encoder only takes a y argument, whereas the pipeline assumes that it will take an X and an optional y.

About this issue

  • Original URL
  • State: closed
  • Created 10 years ago
  • Reactions: 1
  • Comments: 24 (14 by maintainers)

Most upvoted comments

I fount a way to go around the problem by using the CountVectorizer which can turn those strings into a binary representation directly:

from sklearn.feature_extraction.text import CountVectorizer
CountVectorizer().fit_transform(X).todense()
Out[]: 
matrix([[1, 0, 0],
        [0, 0, 1],
        [0, 1, 0],
        [1, 0, 0],
        [0, 1, 0],
        [0, 0, 1]], dtype=int64)

Hey so, I was wondering, what if we made pipeline allow y transformations? It would look like this:

Pipeline(features=[Xtransform1],
    labels=[Ytransform1],
    model=clf)

On fit it would do:

clf.fit(
    Xtransform1.fit_transform(X), 
    Ytransform1.fit_transform(y)
)

On predict it would do:

Ytransform1.inv_transform(
    clf.predict(
        Xtransform1.transform(X)
    )
)

The one big positive I see to this is if you save the model, it could be provided natural input, and return natural output, without having to return to documentation. I just encountered this for work which is why I’m asking.

Would sklearn have any interest in this kind of object?

This can be closed now since the CategoricalEncoder PR is merged (https://github.com/scikit-learn/scikit-learn/pull/9151).