pandas: BUG: Cannot convert existing column to categorical

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

x = pd.DataFrame({
    "A": pd.Categorical(["A", "B"], categories=["A", "B"]),
    "B": [1,2],
    "C": ["D", "E"]
})
print(x.dtypes)
x.loc[:, "C"] = pd.Categorical(x.loc[:, "C"], categories=["D", "E"])
print(x.dtypes)

Issue Description

When setting an existing column to its categorical equivalent, the underlying dtypes stay the same.

Expected Behavior

Output now:

A category B int64 C object dtype: object A category B int64 C object dtype: object

Expected output: A category B int64 C object dtype: object A category B int64 C category <---- dtype: object

Installed Versions

INSTALLED VERSIONS

commit : 478d340667831908b5b4bf09a2787a11a14560c9 python : 3.9.14.final.0 python-bits : 64 OS : Darwin OS-release : 22.4.0 Version : Darwin Kernel Version 22.4.0: Mon Mar 6 20:59:58 PST 2023; root:xnu-8796.101.5~3/RELEASE_ARM64_T6020 machine : arm64 processor : arm byteorder : little LC_ALL : None LANG : None LOCALE : None.UTF-8

pandas : 2.0.0 numpy : 1.23.2 pytz : 2022.7.1 dateutil : 2.8.2 setuptools : 67.4.0 pip : 23.0.1 Cython : 0.29.33 pytest : 7.2.2 hypothesis : None sphinx : 6.1.3 blosc : None feather : None xlsxwriter : 3.0.8 lxml.etree : None html5lib : None pymysql : None psycopg2 : 2.9.5 jinja2 : 3.0.3 IPython : None pandas_datareader: None bs4 : None bottleneck : None brotli : None fastparquet : None fsspec : 2023.1.0 gcsfs : None matplotlib : 3.7.0 numba : 0.56.4 numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : 11.0.0 pyreadstat : None pyxlsb : None s3fs : None scipy : 1.8.1 snappy : None sqlalchemy : 1.4.46 tables : None tabulate : 0.9.0 xarray : None xlrd : None zstandard : None tzdata : 2022.7 qtpy : None pyqt5 : None

About this issue

  • Original URL
  • State: open
  • Created a year ago
  • Reactions: 1
  • Comments: 18 (12 by maintainers)

Most upvoted comments

Still seems to fail for me

>>> x = pd.DataFrame({
...     "A": pd.Categorical(["A", "B"], categories=["A", "B"]),
...     "B": [1,2],
...     "C": ["D", "E"]
... })
>>> print(x.dtypes)
A    category
B       int64
C      object
dtype: object
>>> x.loc[:, "C"] = pd.Categorical(x.loc[:, "C"], categories=["D", "E"])
>>> print(x.dtypes)
A    category
B       int64
C      object
dtype: object

C still seems to be object.

Just wanted to convert dtype (float -> int) of multiple columns in my df and encountered this issue:

# This doesn't work
df.iloc[:, 2:] = df.iloc[:, 2:].astype(int)

# Neither does this
df.loc[:, df.columns[2:]] = df.loc[:, df.columns[2:]].astype(int)

# This works
for col in df.columns[2:]:
    df[col] = df[col].astype(int)

pandas 2.1.3

I have the same issue

import pandas as pd

d = pd.DataFrame(dict(wanna_be_cat=['A', 'B', 'B']))

d.loc[:, 'wanna_be_cat'] = d['wanna_be_cat'].astype('category')

# 'wanna_be_cat' column not modified to become categorical
print(d['wanna_be_cat'].dtype)

d['wanna_be_cat'] = d['wanna_be_cat'].astype('category')

# 'wanna_be_cat' column modified to become categorical
print(d['wanna_be_cat'].dtype)

the fact that it works without .loc but doesn’t with .loc is very confusing.