scikit-learn: CategoricalNB bug with categories present in test but absent in train
Description
Calling predict() / predict_proba() / predict_log_proba() on CategoricalNB model throws IndexError.
Steps/Code to Reproduce
import numpy as np
from sklearn.datasets import make_classification
from sklearn.naive_bayes import CategoricalNB
from sklearn.model_selection import train_test_split
X, y = make_classification(n_features=10, n_classes=3, n_samples=1000, random_state=42,
n_redundant=0, n_informative=6)
X = np.abs(X.astype(np.int))
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)
model = CategoricalNB().fit(X_train, y_train)
model.predict(X_test)
Expected Results
Predictions for X_test(integer labels).
Actual Results
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-3-551bb8080923> in <module>
10 model = CategoricalNB().fit(X_train, y_train)
11
---> 12 model.predict(X_test)
~/Documents/MachineLearning/onnx_projects/skl_env/lib/python3.6/site-packages/sklearn/naive_bayes.py in predict(self, X)
75 check_is_fitted(self)
76 X = self._check_X(X)
---> 77 jll = self._joint_log_likelihood(X)
78 return self.classes_[np.argmax(jll, axis=1)]
79
~/Documents/MachineLearning/onnx_projects/skl_env/lib/python3.6/site-packages/sklearn/naive_bayes.py in _joint_log_likelihood(self, X)
1217 for i in range(self.n_features_):
1218 indices = X[:, i]
-> 1219 jll += self.feature_log_prob_[i][:, indices].T
1220 total_ll = jll + self.class_log_prior_
1221 return total_ll
IndexError: index 5 is out of bounds for axis 1 with size 5
Versions
System: python: 3.6.8 (v3.6.8:3c6b436a57, Dec 24 2018, 02:04:31) [GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)] executable: /Users/prroy/Documents/MachineLearning/onnx_projects/skl_env/bin/python3 machine: Darwin-19.2.0-x86_64-i386-64bit
Python dependencies: pip: 18.1 setuptools: 40.6.2 sklearn: 0.22.1 numpy: 1.18.0 scipy: 1.4.1 Cython: 0.29.14 pandas: 0.25.3 matplotlib: None joblib: 0.14.1
Built with OpenMP: True
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 5
- Comments: 16 (8 by maintainers)
Some categories are present during testing but never seen during training. We should probably have a strategy to handle unknown categories or at least raise a proper error message.
I would be happy with that solution.