category_encoders: HashingEncoder doesn't transform the data

I’m trying to see the output of using HashingEncoder, and I’ve used the original sample code from the documentation, and I don’t see any differences between the transformed and non-transformed dataframe. I suspect the columns are treated as numerical features, so I’ve added my own categorical column (US state code) and try to transform it.

Yet again, the output is same as the input. I’ve even tried using pipeline, and it seems to run normally (as if the categorical column is converted to numerical ones).

So I’m wondering why is the transformed dataframe is not changed? I’m expecting some one-hot-encoded columns with binary values. If it doesn’t work, why it works in the pipeline?

*Pipeline code not included.

Here is the code.

from category_encoders.hashing import HashingEncoder
import numpy as np
import pandas as pd
from sklearn.datasets import load_boston
bunch = load_boston()
y = bunch.target
X = pd.DataFrame(bunch.data, columns=bunch.feature_names)
cardinals = ["AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DC", "DE", "FL", "GA", 
          "HI", "ID", "IL", "IN", "IA", "KS", "KY", "LA", "ME", "MD", 
          "MA", "MI", "MN", "MS", "MO", "MT", "NE", "NV", "NH", "NJ", 
          "NM", "NY", "NC", "ND", "OH", "OK", "OR", "PA", "RI", "SC", 
          "SD", "TN", "TX", "UT", "VT", "VA", "WA", "WV", "WI", "WY"]

X['cat'] = np.random.choice(cardinals, X.shape[0])

enc = HashingEncoder(cols=['cat']).fit(X, y)
enc.fit(X, y)
numeric_dataset = enc.transform(X)
print(numeric_dataset.info())

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Reactions: 1
  • Comments: 27 (12 by maintainers)

Most upvoted comments

@xKHUNx @bruinAlex @janmotl

Bug fixed by turning __require_data() into static require_data().

I’ve found an explanation of issues in windows from python.org HERE.

Also, as far as I know, Windows implements multiprocessing in a different way to linux/unix (WORSE supported in WIN). It brings ANOTHER ISSUE that transforming a small dataset with multiprocessing is SLOWER than single-processing ( ONLY IN WINDOWS ).

In my test, time (second) of different max_process listed below:

(CPU is 2C4T, dataset contains 24 columns, encode 1 column)

rows time of max_process=1 time of max_process=2 time of max_proce=4
5k 5.6719 7.2966 10.9216
10k 9.4524 9.6240 11.3289
50k 33.1552 22.1251 19.0312
100k 60.8440 35.9365 28.2346

I suspect that’s because multiprocessing costs more CPU resource than the acceleration it can provide, anyway, this happens in linux/unix too but only on a really small dataset (less than 100 rows), but yet, I don’t know how to detect the best value of max_process and set automatically. Maybe a little more time is acceptable in a small dataset, as it can still speed up in a large one.

I’ll pull a new PR after CI test.

@xKHUNx @bruinAlex @janmotl

FINALLY, I’ve located the error, it’s been caused because in some case, multiprocessing can’t call the private method __require_data(), but this issue only occurs in Windows and without IDE.

It would be fixed soon, might be in this weekends. Once done I’ll @ you, and sorry about this issue, for temporary, downgrade category_encoders to 2.0.0 or use max_process=1 .

I wouldn’t close it. The new version of HashingEncoder in v2.1 has an experimental support for parallelism. And this seems to cause the problems. And I do not know why since I can’t replicate it.

Please, upgrade back to category_encoders==2.1.0 and rerun your example code but with the fixed count of processes:

enc = HashingEncoder(cols=['cat'], max_process=1).fit(X, y)

The hypothesis is that when multiple processes are used, their results don’t get, for whatever reason, joined correctly. By setting the count of the processes to 1, we avoid the joining.

Also, what:

import multiprocessing

print(multiprocessing.cpu_count())

returns?

Upgraded back to category_encoders==2.1.0

print(multiprocessing.cpu_count()) returns 12, which is expected behaviour as I have a 12 core CPU.

For these cases: enc = HashingEncoder(cols=['cat']).fit(X, y) enc = HashingEncoder(cols=['cat'], max_process=2).fit(X, y) enc = HashingEncoder(cols=['cat'], max_process=3).fit(X, y)

The output is:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 14 columns):
CRIM       506 non-null float64
ZN         506 non-null float64
INDUS      506 non-null float64
CHAS       506 non-null float64
NOX        506 non-null float64
RM         506 non-null float64
AGE        506 non-null float64
DIS        506 non-null float64
RAD        506 non-null float64
TAX        506 non-null float64
PTRATIO    506 non-null float64
B          506 non-null float64
LSTAT      506 non-null float64
cat        506 non-null object
dtypes: float64(13), object(1)
memory usage: 55.5+ KB
None

Only when max_process=1, ie: enc = HashingEncoder(cols=['cat'], max_process=1).fit(X, y)

the output is correct:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 21 columns):
col_0      506 non-null int64
col_1      506 non-null int64
col_2      506 non-null int64
col_3      506 non-null int64
col_4      506 non-null int64
col_5      506 non-null int64
col_6      506 non-null int64
col_7      506 non-null int64
CRIM       506 non-null float64
ZN         506 non-null float64
INDUS      506 non-null float64
CHAS       506 non-null float64
NOX        506 non-null float64
RM         506 non-null float64
AGE        506 non-null float64
DIS        506 non-null float64
RAD        506 non-null float64
TAX        506 non-null float64
PTRATIO    506 non-null float64
B          506 non-null float64
LSTAT      506 non-null float64
dtypes: float64(13), int64(8)
memory usage: 83.1 KB
None

@xKHUNx

Doesn’t verbose=5 make any more output ?

If so, then I may consider it as a random issue, and it will only be able to proved by testing at another computer.

Last weekends my several colleagues were asked for testing this, we’ve tried in Linux(Ubuntu-18.04LTS & 16.04LTS), MacOS(10.12.1 & 10.14.4 & 10.15.1), Windows(7 Family & 10 Professional), with the exactly same code you provided, none of us get this issue.

This issue should be open for a long time, for anyone might replicate it. I’m still working on it, @ me anytime you want.

@janmotl @xKHUNx

I’m sorry this happened, it might be caused by the data concat, I’m working on it.