pandas: Warn on duplicate names in MI?

Opening a new issue so this isn’t lost.

In https://github.com/pandas-dev/pandas/pull/18882 banned duplicate names in a MultiIndex. I think this is a good change since allowing duplicates hit a lot of edge cases when you went to actually do something. I want to make sure we understand all the cases that actually produce duplicate names in the MI though, specifically groupby.apply.

In [1]: import dask.dataframe as dd

In [2]: import pandas as pd

In [3]:     pdf = pd.DataFrame({'a': [1, 2, 3, 4, 5, 6, 7, 8, 9],
   ...:                         'b': [4, 5, 6, 3, 2, 1, 0, 0, 0]},
   ...:                        index=[0, 1, 3, 5, 6, 8, 9, 9, 9]).set_index("a")
   ...:
   ...:

In [4]: pdf.groupby(pdf.index).apply(lambda x: x.b)

Another, more realistic example: groupwise drop_duplicates:

In [18]: df = pd.DataFrame({"B": [0, 0, 0, 1, 1, 1, 2, 2, 2]}, index=pd.Index([0, 1, 1, 2, 2, 2, 0, 0, 1], name='a'))

In [19]: df
Out[19]:
   B
a
0  0
1  0
1  0
2  1
2  1
2  1
0  2
0  2
1  2

In [20]: df.groupby('a').apply(pd.DataFrame.drop_duplicates)
Out[20]:
     B
a a
0 0  0
  0  2
1 1  0
  1  2
2 2  1

Is it possible to throw a warning on this for now, in case duplicate names are more common than we thought?

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Comments: 18 (18 by maintainers)

Commits related to this issue

Most upvoted comments

Hmm, this was supposed to be done for 0.23, but we missed it.

I still think it’s worthwhile doing for 0.23.1 (cc @guenteru if you have time to make a PR).