pandas: Equality between DataFrames misbehaves if columns contain NaN

Code Sample, a copy-pastable example if possible

In [2]: s = pd.DataFrame(-1, index=[1, np.nan, 2,],
   ...:                  columns=[3, np.nan, 1])
   ...: 

In [3]: s + s # good
Out[3]: 
       3.0  NaN    1.0
 1.0    -2    -2    -2
NaN     -2    -2    -2
 2.0    -2    -2    -2

In [4]: s == s # bad
Out[4]: 
       3.0 NaN    1.0
 1.0  True  NaN  True
NaN   True  NaN  True
 2.0  True  NaN  True

Problem description

While it is true that np.nan != np.nan, pandas disregards this in indexes (indeed, s.loc[:, np.nan] works), so it should be coherent.

Expected Output

In [4]: s == s
Out[4]: 
       3.0  NaN   1.0
 1.0  True  True  True
NaN   True  True  True
 2.0  True  True  True

Output of pd.show_versions()

INSTALLED VERSIONS

commit: b45325e283b16ec8869aaea407de8256fc234f33 python: 3.5.3.final.0 python-bits: 64 OS: Linux OS-release: 4.9.0-3-amd64 machine: x86_64 processor: byteorder: little LC_ALL: None LANG: it_IT.UTF-8 LOCALE: it_IT.UTF-8

pandas: 0.22.0.dev0+201.gb45325e28.dirty pytest: 3.2.3 pip: 9.0.1 setuptools: 36.7.0 Cython: 0.25.2 numpy: 1.12.1 scipy: 0.19.0 pyarrow: None xarray: None IPython: 6.2.1 sphinx: 1.5.6 patsy: 0.4.1 dateutil: 2.6.1 pytz: 2017.2 blosc: None bottleneck: 1.2.0dev tables: 3.3.0 numexpr: 2.6.1 feather: 0.3.1 matplotlib: 2.0.0 openpyxl: None xlrd: 1.0.0 xlwt: 1.1.2 xlsxwriter: 0.9.6 lxml: None bs4: 4.5.3 html5lib: 0.999999999 sqlalchemy: 1.0.15 pymysql: None psycopg2: None jinja2: 2.10 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: 0.2.1

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Comments: 17 (17 by maintainers)

Commits related to this issue

Most upvoted comments

By the way, I’m not at all against having equality in your category (1) (dropping the exception I just described), and inserting NaNs… I didn’t propose it just because the change would be bigger and casting bools to objects is sad

I am not sure this was the reason. Because if comparison operations would align, you would 1) align introducing NaNs in the values and 2) compare and where there are NaNs you just get False (just as you would now get with already aligned objects that contains NaNs). So even if comparisons do alignment you can still get a normal functioning boolean result.

I think one of the reasons to not let the comparisons align was 1) make series behaviour consistent with dataframe (but of course, we could also have changed the dataframe behaviour to align as well) and 2) people liked the error as a sanity check (as often, when doing a comparison you want to use it for boolean indexing, and then if you get alignment, that might give unexpected results). One example use case that Wes gave: s1[1:] == s2[:1].

Interesting, but my understanding is that it does not consider the specific issue of having the same labels but in a different order. I understand the reason not to support comparison between different indexes is to avoid NaNs (or dropping elements/rows). What I suggest instead is just to check if labels are equal after sorting.

I don’t think this is a good idea. Most pandas operations already either (1) align arguments or (2) require identical labels. This would add a third type: (3) require same labels, in any order.