scikit-learn: Pipeline doesn't work with Label Encoder
I’ve found that I cannot use pipelines if I wish to use the label encoder. In the following I wish to build a pipeline that first encodes the label and then constructs a one-hot encoding from that labelling.
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.pipeline import make_pipeline
import numpy as np
X = np.array(['cat', 'dog', 'cow', 'cat', 'cow', 'dog'])
enc = LabelEncoder()
hot = OneHotEncoder()
pipe = make_pipeline(enc, hot)
pipe.fit_transform(X)
However, the following error is returned:
lib/python2.7/site-packages/sklearn/pipeline.pyc in _pre_transform(self, X, y, **fit_params)
117 for name, transform in self.steps[:-1]:
118 if hasattr(transform, "fit_transform"):
--> 119 Xt = transform.fit_transform(Xt, y, **fit_params_steps[name])
120 else:
121 Xt = transform.fit(Xt, y, **fit_params_steps[name]) \
TypeError: fit_transform() takes exactly 2 arguments (3 given)
It seems that the problem is that the fit method for label encoder only takes a y argument, whereas the pipeline assumes that it will take an X and an optional y.
About this issue
- Original URL
- State: closed
- Created 10 years ago
- Reactions: 1
- Comments: 24 (14 by maintainers)
I fount a way to go around the problem by using the
CountVectorizer
which can turn those strings into a binary representation directly:Hey so, I was wondering, what if we made pipeline allow y transformations? It would look like this:
On fit it would do:
On predict it would do:
The one big positive I see to this is if you save the model, it could be provided natural input, and return natural output, without having to return to documentation. I just encountered this for work which is why I’m asking.
Would sklearn have any interest in this kind of object?
This can be closed now since the
CategoricalEncoder
PR is merged (https://github.com/scikit-learn/scikit-learn/pull/9151).