pandas: Regression in DataFrame.set_index with class instance column keys
The following code worked in Pandas 0.23.4 but not in Pandas 0.24.0 (I’m on Python 3.7.2).
import pandas as pd
class Thing:
# (Production code would also ensure a Thing instance's hash
# and equality testing depended on name and color)
def __init__(self, name, color):
self.name = name
self.color = color
def __str__(self):
return "<Thing %r>" % (self.name,)
thing1 = Thing('One', 'red')
thing2 = Thing('Two', 'blue')
df = pd.DataFrame({thing1: [0, 1], thing2: [2, 3]})
df.set_index([thing2])
In Pandas 0.23.4, I get the following correct result:
<Thing 'One'>
<Thing 'Two'>
2 0
3 1
In Pandas 0.24.0, I get the following error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File ".../venv/lib/python3.7/site-packages/pandas/core/frame.py", line 4153, in set_index
raise ValueError(err_msg)
ValueError: The parameter "keys" may be a column key, one-dimensional array, or a list containing only valid column keys and one-dimensional arrays.
After looking at Pandas 0.24.0’s implementation of DataFrame.set_index
:
https://github.com/pandas-dev/pandas/blob/83eb2428ceb6257042173582f3f436c2c887aa69/pandas/core/frame.py#L4144-L4153
I noticed that is_scalar
returns False
for thing1
in Pandas 0.24.0:
>>> from pandas.core.dtypes.common import is_scalar
>>> is_scalar(thing1)
False
I suspect that it is incorrect to test DataFrame column keys using is_scalar
.
Output of pd.show_versions()
pd.show_versions()
from Pandas 0.23.4
INSTALLED VERSIONS
commit: None python: 3.7.2.final.0 python-bits: 64 OS: Darwin OS-release: 17.7.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8
pandas: 0.23.4 pytest: None pip: 18.1 setuptools: 40.4.3 Cython: None numpy: 1.16.0 scipy: 1.1.0 pyarrow: None xarray: None IPython: 7.2.0 sphinx: None patsy: None dateutil: 2.7.3 pytz: 2018.5 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: None openpyxl: None xlrd: None xlwt: None xlsxwriter: 1.1.2 lxml: None bs4: None html5lib: None sqlalchemy: None pymysql: None psycopg2: None jinja2: 2.10 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None
pd.show_versions()
from Pandas 0.24.0
INSTALLED VERSIONS
commit: None python: 3.7.2.final.0 python-bits: 64 OS: Darwin OS-release: 17.7.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8
pandas: 0.24.0 pytest: None pip: 18.1 setuptools: 40.4.3 Cython: None numpy: 1.16.0 scipy: 1.1.0 pyarrow: None xarray: None IPython: 7.2.0 sphinx: None patsy: None dateutil: 2.7.3 pytz: 2018.5 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: None openpyxl: None xlrd: None xlwt: None xlsxwriter: 1.1.2 lxml.etree: None bs4: None html5lib: None sqlalchemy: None pymysql: None psycopg2: None jinja2: 2.10 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None gcsfs: None
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 22 (17 by maintainers)
@toobaz is there a reason you closed this? As I don’t think the issue is resolved (or the typical wrong button mistake; github UI should put them farther apart 😃)
Mixed type indexes are non-sortable since Python 3: to my eyes the problem is analogous, and not particularly worrisome. For my experience, there are some reshaping ops in which we try to sort, and fallback to not sorting, and this is (I think) OK.