scikit-learn: CategoricalNB bug with categories present in test but absent in train

Description

Calling predict() / predict_proba() / predict_log_proba() on CategoricalNB model throws IndexError.

Steps/Code to Reproduce

import numpy as np
from sklearn.datasets import make_classification
from sklearn.naive_bayes import CategoricalNB
from sklearn.model_selection import train_test_split

X, y = make_classification(n_features=10, n_classes=3, n_samples=1000, random_state=42,
                                             n_redundant=0, n_informative=6)
X = np.abs(X.astype(np.int))
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)
model = CategoricalNB().fit(X_train, y_train)

model.predict(X_test)

Expected Results

Predictions for X_test(integer labels).

Actual Results

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-3-551bb8080923> in <module>
     10 model = CategoricalNB().fit(X_train, y_train)
     11 
---> 12 model.predict(X_test)

~/Documents/MachineLearning/onnx_projects/skl_env/lib/python3.6/site-packages/sklearn/naive_bayes.py in predict(self, X)
     75         check_is_fitted(self)
     76         X = self._check_X(X)
---> 77         jll = self._joint_log_likelihood(X)
     78         return self.classes_[np.argmax(jll, axis=1)]
     79 

~/Documents/MachineLearning/onnx_projects/skl_env/lib/python3.6/site-packages/sklearn/naive_bayes.py in _joint_log_likelihood(self, X)
   1217         for i in range(self.n_features_):
   1218             indices = X[:, i]
-> 1219             jll += self.feature_log_prob_[i][:, indices].T
   1220         total_ll = jll + self.class_log_prior_
   1221         return total_ll

IndexError: index 5 is out of bounds for axis 1 with size 5

Versions

System: python: 3.6.8 (v3.6.8:3c6b436a57, Dec 24 2018, 02:04:31) [GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)] executable: /Users/prroy/Documents/MachineLearning/onnx_projects/skl_env/bin/python3 machine: Darwin-19.2.0-x86_64-i386-64bit

Python dependencies: pip: 18.1 setuptools: 40.6.2 sklearn: 0.22.1 numpy: 1.18.0 scipy: 1.4.1 Cython: 0.29.14 pandas: 0.25.3 matplotlib: None joblib: 0.14.1

Built with OpenMP: True

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 5
  • Comments: 16 (8 by maintainers)

Most upvoted comments

Some categories are present during testing but never seen during training. We should probably have a strategy to handle unknown categories or at least raise a proper error message.

I would be happy with that solution.