pandas: Warn on duplicate names in MI?
Opening a new issue so this isn’t lost.
In https://github.com/pandas-dev/pandas/pull/18882 banned duplicate names in a MultiIndex. I think this is a good change since allowing duplicates hit a lot of edge cases when you went to actually do something. I want to make sure we understand all the cases that actually produce duplicate names in the MI though, specifically groupby.apply.
In [1]: import dask.dataframe as dd
In [2]: import pandas as pd
In [3]: pdf = pd.DataFrame({'a': [1, 2, 3, 4, 5, 6, 7, 8, 9],
...: 'b': [4, 5, 6, 3, 2, 1, 0, 0, 0]},
...: index=[0, 1, 3, 5, 6, 8, 9, 9, 9]).set_index("a")
...:
...:
In [4]: pdf.groupby(pdf.index).apply(lambda x: x.b)
Another, more realistic example: groupwise drop_duplicates:
In [18]: df = pd.DataFrame({"B": [0, 0, 0, 1, 1, 1, 2, 2, 2]}, index=pd.Index([0, 1, 1, 2, 2, 2, 0, 0, 1], name='a'))
In [19]: df
Out[19]:
B
a
0 0
1 0
1 0
2 1
2 1
2 1
0 2
0 2
1 2
In [20]: df.groupby('a').apply(pd.DataFrame.drop_duplicates)
Out[20]:
B
a a
0 0 0
0 2
1 1 0
1 2
2 2 1
Is it possible to throw a warning on this for now, in case duplicate names are more common than we thought?
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Comments: 18 (18 by maintainers)
Commits related to this issue
- add fix for bug #19029 As of version 0.23.0 MultiIndex throws an exception in case it contains duplicated level names. This can happen as a result of various groupby operations (#21075). This commit ... — committed to guenteru/pandas by guenteru 6 years ago
- add groupby testcase (#19029) — committed to guenteru/pandas by guenteru 6 years ago
Hmm, this was supposed to be done for 0.23, but we missed it.
I still think it’s worthwhile doing for 0.23.1 (cc @guenteru if you have time to make a PR).