pandas: Regression in DataFrame.set_index with class instance column keys

The following code worked in Pandas 0.23.4 but not in Pandas 0.24.0 (I’m on Python 3.7.2).

import pandas as pd

class Thing:
    # (Production code would also ensure a Thing instance's hash
    # and equality testing depended on name and color)

    def __init__(self, name, color):
        self.name = name
        self.color = color
    
    def __str__(self):
        return "<Thing %r>" % (self.name,)

thing1 = Thing('One', 'red')
thing2 = Thing('Two', 'blue')
df = pd.DataFrame({thing1: [0, 1], thing2: [2, 3]})
df.set_index([thing2])

In Pandas 0.23.4, I get the following correct result:

               <Thing 'One'>
<Thing 'Two'>               
2                          0
3                          1

In Pandas 0.24.0, I get the following error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File ".../venv/lib/python3.7/site-packages/pandas/core/frame.py", line 4153, in set_index
    raise ValueError(err_msg)
ValueError: The parameter "keys" may be a column key, one-dimensional array, or a list containing only valid column keys and one-dimensional arrays.

After looking at Pandas 0.24.0’s implementation of DataFrame.set_index: https://github.com/pandas-dev/pandas/blob/83eb2428ceb6257042173582f3f436c2c887aa69/pandas/core/frame.py#L4144-L4153 I noticed that is_scalar returns False for thing1 in Pandas 0.24.0:

>>> from pandas.core.dtypes.common import is_scalar
>>> is_scalar(thing1)
False

I suspect that it is incorrect to test DataFrame column keys using is_scalar.

Output of pd.show_versions()

pd.show_versions() from Pandas 0.23.4

INSTALLED VERSIONS

commit: None python: 3.7.2.final.0 python-bits: 64 OS: Darwin OS-release: 17.7.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.23.4 pytest: None pip: 18.1 setuptools: 40.4.3 Cython: None numpy: 1.16.0 scipy: 1.1.0 pyarrow: None xarray: None IPython: 7.2.0 sphinx: None patsy: None dateutil: 2.7.3 pytz: 2018.5 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: None openpyxl: None xlrd: None xlwt: None xlsxwriter: 1.1.2 lxml: None bs4: None html5lib: None sqlalchemy: None pymysql: None psycopg2: None jinja2: 2.10 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None

pd.show_versions() from Pandas 0.24.0

INSTALLED VERSIONS

commit: None python: 3.7.2.final.0 python-bits: 64 OS: Darwin OS-release: 17.7.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.24.0 pytest: None pip: 18.1 setuptools: 40.4.3 Cython: None numpy: 1.16.0 scipy: 1.1.0 pyarrow: None xarray: None IPython: 7.2.0 sphinx: None patsy: None dateutil: 2.7.3 pytz: 2018.5 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: None openpyxl: None xlrd: None xlwt: None xlsxwriter: 1.1.2 lxml.etree: None bs4: None html5lib: None sqlalchemy: None pymysql: None psycopg2: None jinja2: 2.10 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None gcsfs: None

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 22 (17 by maintainers)

Most upvoted comments

@toobaz is there a reason you closed this? As I don’t think the issue is resolved (or the typical wrong button mistake; github UI should put them farther apart 😃)

Users will need to remember only a simple rule: “hashable -> works”.

Just giving this some more thought today…is this all we should require? There are a lot of index operations that also require the concept of sortability

Mixed type indexes are non-sortable since Python 3: to my eyes the problem is analogous, and not particularly worrisome. For my experience, there are some reshaping ops in which we try to sort, and fallback to not sorting, and this is (I think) OK.