pandas: pd.crosstab, categorical data and missing instances
Code Sample, a copy-pastable example if possible
import pandas as pd
foo = pd.Categorical(['a', 'b'], categories=['a', 'b', 'c'])
bar = pd.Categorical(['d', 'e'], categories=['d', 'e', 'f'])
pd.crosstab(foo, bar)
col_0 d e
row_0
a 1 0
b 0 1
c 0 0
Problem description
This is from the documentation:
Any input passed containing Categorical data will have all of its categories included in the cross-tabulation, even if the actual data does not contain any instances of a particular category.
However, f is not included in the table while c is.
Please let me know if this is in fact a bug, then I will be glad to write give writing a patch a try.
Thanks a lot in advance!
Expected Output
col_0 d e f row_0 a 1 0 0 b 0 1 0 c 0 0 0
Output of pd.show_versions()
pandas: 0.20.1 pytest: 2.8.7 pip: 8.1.1 setuptools: 20.7.0 Cython: None numpy: 1.12.1 scipy: 0.17.0 xarray: None IPython: None sphinx: None patsy: 0.4.1 dateutil: 2.6.0 pytz: 2017.2 blosc: None bottleneck: None tables: 3.2.2 numexpr: 2.6.2 feather: None matplotlib: 1.5.1 openpyxl: 2.3.0 xlrd: 0.9.4 xlwt: 0.7.5 xlsxwriter: None lxml: 3.5.0 bs4: 4.4.1 html5lib: 0.999 sqlalchemy: None pymysql: None psycopg2: None jinja2: None s3fs: None pandas_gbq: None pandas_datareader: None
About this issue
- Original URL
- State: open
- Created 7 years ago
- Comments: 15 (10 by maintainers)
Could a solution to this problem be to change the default of
dropna
to None instead of True? So ifdropna=None
would then depend on the dtype: False for categorical, True for other dtypes.