imbalanced-learn: ValueError: could not convert string to float: 'aaa'

I have imbalanced classes with 10,000 1s and 10m 0s. I want to undersample before I convert category columns to dummies to save memory. I expected it would ignore the content of x and randomly select based on y. However I get the above error. What am I not understanding and how do I do this without converting category features to dummies first?

clf_sample = RandomUnderSampler(ratio=.025)
x = pd.DataFrame(np.random.random((100,5)), columns=list("abcde"))
x.loc[:, "b"] = "aaa"
clf_sample.fit(x, y.head(100))

About this issue

  • Original URL
  • State: closed
  • Created 8 years ago
  • Comments: 22 (11 by maintainers)

Most upvoted comments

Solved in master for RandomUnderSampling and RandomOverSampling. Just have to wait scikit-learn 0.20 such that we can release as well 0.4.

Though now all the number columns are converted to strings! An alternative is to just pass the index of my dataframe to the sampler; then select the rows from the result. That should work…unless you can think of a better solution.

Hello!

I’m getting this error with imblearn v0.3.3 when trying to use RandomUnderSampler.fit_sample() when X includes a column with string values.

The problem is caused due to sklearn.utils.check_X_y being called in the following form: check_X_y(X, y, accept_sparse=['csr', 'csc']) Since the dtype parameter is not specified explicitly, it is set to "numeric" by default, as detailed in the function’s documentation here: https://github.com/scikit-learn/scikit-learn/blob/a24c8b46/sklearn/utils/validation.py#L479

As such, the defined behavior of check_X_y in this case is: "If "numeric", dtype is preserved unless array.dtype is object.

I’ve cloned your repo and had to add dtype=None to the call to check_X_y in both SamplerMixin.sample() and BaseSampler.fit() to get RandomUnderSampler to work with string data.

Since prototype selection methods, unlike prototype generation methods, can support any kind of data, I think this check should not be forced for such methods.

A possible design is to add a _check_X_y method to SamplerMixin or BaseSampler which will call sklearn.utils.check_X_y(X, y, accept_sparse=['csr', 'csc']), and have prototype selection methods override this method with a version which will instead call sklearn.utils.check_X_y(X, y, accept_sparse=['csr', 'csc'], dtype=None)

Whatever the design, if one can be agreed on / you can advise me on one, I don’t mind writing it myself and opening a pull request. That is, assuming you agree with me that non-numeric data should be allowed for prototype selection methods.

Cheers, (and what a great package!) Shay

@simonm3 you could pass the index as you said

import numpy as np
import pandas as pd

from imblearn.under_sampling import RandomUnderSampler

# create test data
X = np.array([['aaa'] * 100, ['bbb'] * 100]).T
X_df = pd.DataFrame(X, columns=list("ab"), index=range(1000, 1100))
y = np.array([0] * 10 + [1] * 90)

# numpy test
X_res1, y_res1 = RandomUnderSampler().fit_sample(X, y)

#  pandas test
X_i = X_df.index.values.reshape(-1, 1)
_, _, i = RandomUnderSampler(return_indices=True).fit_sample(X_i, y)
X_res2, y_res2 = X_df.iloc[i, :], y[i]

@dvro @glemaitre we could implicitly support pandas