imbalanced-learn: ValueError: could not convert string to float: 'aaa'
I have imbalanced classes with 10,000 1s and 10m 0s. I want to undersample before I convert category columns to dummies to save memory. I expected it would ignore the content of x and randomly select based on y. However I get the above error. What am I not understanding and how do I do this without converting category features to dummies first?
clf_sample = RandomUnderSampler(ratio=.025)
x = pd.DataFrame(np.random.random((100,5)), columns=list("abcde"))
x.loc[:, "b"] = "aaa"
clf_sample.fit(x, y.head(100))
About this issue
- Original URL
- State: closed
- Created 8 years ago
- Comments: 22 (11 by maintainers)
Solved in master for RandomUnderSampling and RandomOverSampling. Just have to wait scikit-learn 0.20 such that we can release as well 0.4.
Though now all the number columns are converted to strings! An alternative is to just pass the index of my dataframe to the sampler; then select the rows from the result. That should work…unless you can think of a better solution.
Hello!
I’m getting this error with
imblearn v0.3.3
when trying to useRandomUnderSampler.fit_sample()
when X includes a column with string values.The problem is caused due to
sklearn.utils.check_X_y
being called in the following form:check_X_y(X, y, accept_sparse=['csr', 'csc'])
Since thedtype
parameter is not specified explicitly, it is set to"numeric"
by default, as detailed in the function’s documentation here: https://github.com/scikit-learn/scikit-learn/blob/a24c8b46/sklearn/utils/validation.py#L479As such, the defined behavior of
check_X_y
in this case is:"If "numeric", dtype is preserved unless array.dtype is object
.I’ve cloned your repo and had to add
dtype=None
to the call tocheck_X_y
in bothSamplerMixin.sample()
andBaseSampler.fit()
to getRandomUnderSampler
to work with string data.Since prototype selection methods, unlike prototype generation methods, can support any kind of data, I think this check should not be forced for such methods.
A possible design is to add a
_check_X_y
method toSamplerMixin
orBaseSampler
which will callsklearn.utils.check_X_y(X, y, accept_sparse=['csr', 'csc'])
, and have prototype selection methods override this method with a version which will instead callsklearn.utils.check_X_y(X, y, accept_sparse=['csr', 'csc'], dtype=None)
Whatever the design, if one can be agreed on / you can advise me on one, I don’t mind writing it myself and opening a pull request. That is, assuming you agree with me that non-numeric data should be allowed for prototype selection methods.
Cheers, (and what a great package!) Shay
@simonm3 you could pass the index as you said
@dvro @glemaitre we could implicitly support pandas