scikit-learn: Categorical Naive Bayes not available

Right now Scikit-Learn provides several Naive Bayes models.

  • GaussianNB: For continuous features that are assumed to be Gaussian distributed.
  • MultinomialNB: For discreet features that are multinomially distributed, e.g. counts of words of occurrences
  • BernoulliNB: For indicator features (True/False) which are assumed to be Bernoulli distributed

The obvious thing that is missing is a variant for categorical features like color for instance. It is of course possible to use dummy encoding to transform a categorical feature into indicator features for each category but this breaks the categorical correlation. If a car is red, it obviously isn’t green and yellow.

So long story short, are there any plans to add a CategoricalNB? Would you like to see a PR? Or am I missing here something obvious?

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Reactions: 7
  • Comments: 18 (15 by maintainers)

Most upvoted comments

Hi, is anybody already working on this issue? If not, I would like to start working on it. Right now it seems to me that both CategoricalNB and GeneralNB are relevant, as I am going to discuss in the following.

There are many questions regarding categorical features with tree-based-models ( 8480, 11258 …) and there is also the big PR that is still being merged NOCATS. Going through the issues, there does seem to be less demand for a unification of categorical features with NB than for tree-based model. However, I know of colleagues of mine, who use categorical features with NB in their work and who would benefit from such an implementation. Additionally, there is the already mentioned stackoverflow question and further, it feels natural to me to also extend NB with categorical features to keep it up with the enhancement of other models (e.g. tree-based ones).

Concerning GeneralNB: New features as ColumnTransformer and requests as 11379 indicate that there is generally speaking a demand for models that take input vectors with mixed feature types. Also, this quite popular stack overflow question Mixing categorial and continuous data in Naive Bayes classifier using scikit-learn directly goes into the essence of a GeneralNB. GeneralNB seems, therefore, similar to the reasoning for CategoricalNB, relevant to me because there is a direct demand for it and keeping NB up with the direction of the general development of scikit-learn seems like a good idea to me.

Overall, I think an implementation would go well with the first part of the telnet “Simple things should be simple, complex things should be possible.”, that is often used in context with scikit-learn. I would, therefore, propose to implement both and start with the CategoricalNB. @jnothman already mentioned that he is more interested in GeneralNB. What are your thoughts?

@remykarem Why don’t you extract the Mixed part and provide a PR for Scikit-Learn? As you can read above, the maintainers are looking for someone to provide a GeneralNB. It would be great if you could help out.

It would, I think, also be instructive that such an approach is valid and even recommended when features come from different distributions. And it would imply that users can go elsewhere for distributions we do not provide. I’d be keen to see a GeneralNB.

@jnothman, from a user’s perspective this would make things less complicated and also clearer what to do in case of different feature distributions. A GeneralNB could be implemented as MetaEstimator which takes a list of tuples that specify the actual Naive Bayes classifier and the columns to apply it on, e.g. [(GaussianNB(), [0,1,2,3]), (BernoulliNB(), [4,5]), (CategoricalNB(alpha=0.2), [6, 7])].