category_encoders: HashingEncoder doesn't transform the data
I’m trying to see the output of using HashingEncoder, and I’ve used the original sample code from the documentation, and I don’t see any differences between the transformed and non-transformed dataframe. I suspect the columns are treated as numerical features, so I’ve added my own categorical column (US state code) and try to transform it.
Yet again, the output is same as the input. I’ve even tried using pipeline, and it seems to run normally (as if the categorical column is converted to numerical ones).
So I’m wondering why is the transformed dataframe is not changed? I’m expecting some one-hot-encoded columns with binary values. If it doesn’t work, why it works in the pipeline?
*Pipeline code not included.
Here is the code.
from category_encoders.hashing import HashingEncoder
import numpy as np
import pandas as pd
from sklearn.datasets import load_boston
bunch = load_boston()
y = bunch.target
X = pd.DataFrame(bunch.data, columns=bunch.feature_names)
cardinals = ["AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DC", "DE", "FL", "GA",
"HI", "ID", "IL", "IN", "IA", "KS", "KY", "LA", "ME", "MD",
"MA", "MI", "MN", "MS", "MO", "MT", "NE", "NV", "NH", "NJ",
"NM", "NY", "NC", "ND", "OH", "OK", "OR", "PA", "RI", "SC",
"SD", "TN", "TX", "UT", "VT", "VA", "WA", "WV", "WI", "WY"]
X['cat'] = np.random.choice(cardinals, X.shape[0])
enc = HashingEncoder(cols=['cat']).fit(X, y)
enc.fit(X, y)
numeric_dataset = enc.transform(X)
print(numeric_dataset.info())
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Reactions: 1
- Comments: 27 (12 by maintainers)
@xKHUNx @bruinAlex @janmotl
Bug fixed by turning
__require_data()
into staticrequire_data()
.I’ve found an explanation of issues in windows from python.org HERE.
Also, as far as I know, Windows implements multiprocessing in a different way to linux/unix (WORSE supported in WIN). It brings ANOTHER ISSUE that transforming a small dataset with multiprocessing is SLOWER than single-processing ( ONLY IN WINDOWS ).
In my test, time (second) of different
max_process
listed below:(CPU is 2C4T, dataset contains 24 columns, encode 1 column)
I suspect that’s because multiprocessing costs more CPU resource than the acceleration it can provide, anyway, this happens in linux/unix too but only on a really small dataset (less than 100 rows), but yet, I don’t know how to detect the best value of
max_process
and set automatically. Maybe a little more time is acceptable in a small dataset, as it can still speed up in a large one.I’ll pull a new PR after CI test.
@xKHUNx @bruinAlex @janmotl
FINALLY, I’ve located the error, it’s been caused because in some case, multiprocessing can’t call the private method
__require_data()
, but this issue only occurs in Windows and without IDE.It would be fixed soon, might be in this weekends. Once done I’ll @ you, and sorry about this issue, for temporary, downgrade category_encoders to 2.0.0 or use
max_process=1
.I wouldn’t close it. The new version of HashingEncoder in v2.1 has an experimental support for parallelism. And this seems to cause the problems. And I do not know why since I can’t replicate it.
Please, upgrade back to
category_encoders==2.1.0
and rerun your example code but with the fixed count of processes:The hypothesis is that when multiple processes are used, their results don’t get, for whatever reason, joined correctly. By setting the count of the processes to 1, we avoid the joining.
Also, what:
returns?
Upgraded back to
category_encoders==2.1.0
print(multiprocessing.cpu_count())
returns 12, which is expected behaviour as I have a 12 core CPU.For these cases:
enc = HashingEncoder(cols=['cat']).fit(X, y)
enc = HashingEncoder(cols=['cat'], max_process=2).fit(X, y)
enc = HashingEncoder(cols=['cat'], max_process=3).fit(X, y)
The output is:
Only when
max_process=1
, ie:enc = HashingEncoder(cols=['cat'], max_process=1).fit(X, y)
the output is correct:
@xKHUNx
Doesn’t
verbose=5
make any more output ?If so, then I may consider it as a random issue, and it will only be able to proved by testing at another computer.
Last weekends my several colleagues were asked for testing this, we’ve tried in Linux(Ubuntu-18.04LTS & 16.04LTS), MacOS(10.12.1 & 10.14.4 & 10.15.1), Windows(7 Family & 10 Professional), with the exactly same code you provided, none of us get this issue.
This issue should be open for a long time, for anyone might replicate it. I’m still working on it, @ me anytime you want.
@janmotl @xKHUNx
I’m sorry this happened, it might be caused by the data concat, I’m working on it.