pandas: BUG: "cannot reindex from duplicate axis" thrown using unique indexes, duplicated column names and a specific numpy array values

Code Sample

import pandas 
import numpy as np

a = np.array([[1,2],[3,4]]) 

# DO NOT WORKS
b = np.array([[0.5,6],[7,8]])  
# b = np.array([[.5,6],[7,8]])  # The same problem

# This one works fine:
# b = np.array([[5,6],[7,8]]) 

dfA = pandas.DataFrame(a)
# This works fine EVEN using .5, because the columns name is different
# dfA = pandas.DataFrame(a, columns=['a','b'])
dfB = pandas.DataFrame(b)

df_new = pandas.concat([dfA, dfB], axis = 1)

print(df_new[df_new > 5])

Problem description

It has a bug that combines numpy specific values and duplicated DataFrame column names when it’s used a select operation, such as df[df > 5]. A exception is thrown saying “cannot reindex from duplicate axis”, however It should not be, because:

  • The DataFrame has no duplicated indexes ( df.index.is_unique is True)
  • The DataFrame has duplicated column names, but should not be a problem when we apply the selection operation, such as df_new[df_new > 5]
  • The DataFrame uses float or int numpy values, so it should not change the behavior of the code

However the values in the numpy array DO changes the behavior of the DataFrame selection, if the DataFrame has duplicated column names.

Expected Output

    0   1    0  1
0 NaN NaN  NaN  6
1 NaN NaN  7.0  8

Current Output

~/.local/lib/python3.6/site-packages/pandas/core/indexes/base.py in _can_reindex(self, indexer)
   3097         # trying to reindex on an axis with duplicates
   3098         if not self.is_unique and len(indexer):
-> 3099             raise ValueError("cannot reindex from a duplicate axis")
   3100 
   3101     def reindex(self, target, method=None, level=None, limit=None, tolerance=None):

ValueError: cannot reindex from a duplicate axis

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None python : 3.6.9.final.0 python-bits : 64 OS : Linux OS-release : 5.3.0-28-generic machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : pt_BR.UTF-8

pandas : 1.0.1 numpy : 1.18.1 pytz : 2019.3 dateutil : 2.8.1 pip : 20.0.2 setuptools : 45.2.0 Cython : None pytest : None hypothesis : None sphinx : 2.3.1 blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 2.11.1 IPython : 7.12.0 pandas_datareader: None bs4 : None bottleneck : None fastparquet : None gcsfs : None lxml.etree : None matplotlib : 3.1.3 numexpr : 2.7.1 odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pytables : None pytest : None pyxlsb : None s3fs : None scipy : 1.4.1 sqlalchemy : None tables : 3.6.1 tabulate : None xarray : None xlrd : None xlwt : None xlsxwriter : None numba : None

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 32 (17 by maintainers)

Most upvoted comments

Yup 😄 @GabrielSimonetto if you wanted to submit a test to make sure this doesn’t break again in the future, that would be welcome!

@MarcoGorelli I found that building instead with the command CFLAGS='-Wno-error=deprecated-declarations' python setup.py build_ext -i generally fixes things, although I’m not sure if it’ll work in this case. There’s a thread about these problems here: https://github.com/pandas-dev/pandas/issues/33315

@igorluppi great, thanks! Could you edit this example into the original post?