pandas: pd.crosstab, categorical data and missing instances

Code Sample, a copy-pastable example if possible

import pandas as pd
foo = pd.Categorical(['a', 'b'], categories=['a', 'b', 'c'])
bar = pd.Categorical(['d', 'e'], categories=['d', 'e', 'f'])
pd.crosstab(foo, bar)

col_0  d  e
row_0      
a      1  0
b      0  1
c      0  0

Problem description

This is from the documentation:

Any input passed containing Categorical data will have all of its categories included in the cross-tabulation, even if the actual data does not contain any instances of a particular category.

However, f is not included in the table while c is.

Please let me know if this is in fact a bug, then I will be glad to write give writing a patch a try.

Thanks a lot in advance!

Expected Output

col_0 d e f row_0 a 1 0 0 b 0 1 0 c 0 0 0

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 2.7.12.final.0 python-bits: 64 OS: Linux OS-release: 4.8.0-49-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.20.1 pytest: 2.8.7 pip: 8.1.1 setuptools: 20.7.0 Cython: None numpy: 1.12.1 scipy: 0.17.0 xarray: None IPython: None sphinx: None patsy: 0.4.1 dateutil: 2.6.0 pytz: 2017.2 blosc: None bottleneck: None tables: 3.2.2 numexpr: 2.6.2 feather: None matplotlib: 1.5.1 openpyxl: 2.3.0 xlrd: 0.9.4 xlwt: 0.7.5 xlsxwriter: None lxml: 3.5.0 bs4: 4.4.1 html5lib: 0.999 sqlalchemy: None pymysql: None psycopg2: None jinja2: None s3fs: None pandas_gbq: None pandas_datareader: None

About this issue

Original URL
State: open
Created 7 years ago
Comments: 15 (10 by maintainers)

Most upvoted comments

Could a solution to this problem be to change the default of dropna to None instead of True? So if dropna=None would then depend on the dtype: False for categorical, True for other dtypes.

jorisvandenbossche on Sep 1, 2018