pandas: Equality between DataFrames misbehaves if columns contain NaN

Code Sample, a copy-pastable example if possible

In [2]: s = pd.DataFrame(-1, index=[1, np.nan, 2,],
   ...:                  columns=[3, np.nan, 1])
   ...: 

In [3]: s + s # good
Out[3]: 
       3.0  NaN    1.0
 1.0    -2    -2    -2
NaN     -2    -2    -2
 2.0    -2    -2    -2

In [4]: s == s # bad
Out[4]: 
       3.0 NaN    1.0
 1.0  True  NaN  True
NaN   True  NaN  True
 2.0  True  NaN  True

Problem description

While it is true that np.nan != np.nan, pandas disregards this in indexes (indeed, s.loc[:, np.nan] works), so it should be coherent.

Expected Output

In [4]: s == s
Out[4]: 
       3.0  NaN   1.0
 1.0  True  True  True
NaN   True  True  True
 2.0  True  True  True

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: b45325e283b16ec8869aaea407de8256fc234f33 python: 3.5.3.final.0 python-bits: 64 OS: Linux OS-release: 4.9.0-3-amd64 machine: x86_64 processor: byteorder: little LC_ALL: None LANG: it_IT.UTF-8 LOCALE: it_IT.UTF-8

pandas: 0.22.0.dev0+201.gb45325e28.dirty pytest: 3.2.3 pip: 9.0.1 setuptools: 36.7.0 Cython: 0.25.2 numpy: 1.12.1 scipy: 0.19.0 pyarrow: None xarray: None IPython: 6.2.1 sphinx: 1.5.6 patsy: 0.4.1 dateutil: 2.6.1 pytz: 2017.2 blosc: None bottleneck: 1.2.0dev tables: 3.3.0 numexpr: 2.6.1 feather: 0.3.1 matplotlib: 2.0.0 openpyxl: None xlrd: 1.0.0 xlwt: 1.1.2 xlsxwriter: 0.9.6 lxml: None bs4: 4.5.3 html5lib: 0.999999999 sqlalchemy: 1.0.15 pymysql: None psycopg2: None jinja2: 2.10 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: 0.2.1

About this issue

Original URL
State: closed
Created 7 years ago
Comments: 17 (17 by maintainers)

Commits related to this issue

BUG: Fix initialization of DataFrame from dict with NaN as key closes #18455 — committed to toobaz/pandas by toobaz 7 years ago
BUG: Fix initialization of DataFrame from dict with NaN as key closes #18455 — committed to toobaz/pandas by toobaz 7 years ago
BUG: Fix initialization of DataFrame from dict with NaN as key closes #18455 — committed to toobaz/pandas by toobaz 7 years ago
BUG: Fix initialization of DataFrame from dict with NaN as key closes #18455 — committed to toobaz/pandas by toobaz 7 years ago
BUG: Fix initialization of DataFrame from dict with NaN as key closes #18455 — committed to toobaz/pandas by toobaz 7 years ago
BUG: Fix initialization of DataFrame from dict with NaN as key closes #18455 — committed to toobaz/pandas by toobaz 7 years ago
BUG: Fix initialization of DataFrame from dict with NaN as key closes #18455 — committed to toobaz/pandas by toobaz 7 years ago
TST: removed workaround for #18455 — committed to toobaz/pandas by toobaz 6 years ago
BUG: Fix initialization of DataFrame from dict with NaN as key closes #18455 closes #19646 — committed to toobaz/pandas by toobaz 7 years ago
TST: removed workaround for #18455 — committed to toobaz/pandas by toobaz 6 years ago
BUG: Fix initialization of DataFrame from dict with NaN as key closes #18455 closes #19646 — committed to toobaz/pandas by toobaz 7 years ago
TST: removed workaround for #18455 — committed to toobaz/pandas by toobaz 6 years ago
BUG: Fix initialization of DataFrame from dict with NaN as key closes #18455 closes #19646 — committed to toobaz/pandas by toobaz 7 years ago
TST: removed workaround for #18455 — committed to toobaz/pandas by toobaz 6 years ago
BUG: Fix initialization of DataFrame from dict with NaN as key (#18600) closes #18455 closes #19646 * TST: removed workaround for #18455 * TST: asv for DataFrame from dict with index/columns — committed to pandas-dev/pandas by toobaz 6 years ago

Most upvoted comments

By the way, I’m not at all against having equality in your category (1) (dropping the exception I just described), and inserting NaNs… I didn’t propose it just because the change would be bigger and casting bools to objects is sad

I am not sure this was the reason. Because if comparison operations would align, you would 1) align introducing NaNs in the values and 2) compare and where there are NaNs you just get False (just as you would now get with already aligned objects that contains NaNs). So even if comparisons do alignment you can still get a normal functioning boolean result.

I think one of the reasons to not let the comparisons align was 1) make series behaviour consistent with dataframe (but of course, we could also have changed the dataframe behaviour to align as well) and 2) people liked the error as a sanity check (as often, when doing a comparison you want to use it for boolean indexing, and then if you get alignment, that might give unexpected results). One example use case that Wes gave: s1[1:] == s2[:1].

jorisvandenbossche on Jan 16, 2018

Interesting, but my understanding is that it does not consider the specific issue of having the same labels but in a different order. I understand the reason not to support comparison between different indexes is to avoid NaNs (or dropping elements/rows). What I suggest instead is just to check if labels are equal after sorting.

I don’t think this is a good idea. Most pandas operations already either (1) align arguments or (2) require identical labels. This would add a third type: (3) require same labels, in any order.

shoyer on Nov 25, 2017