pandas: BUG: None is not equal to None

If you have None in a series, it will not be considered equal to None (even in the same series).

A = pd.Series([None]*3)
A == A

gives

0    False
1    False
2    False
dtype: bool

but, of course

A.values == A.values

gives

array([ True,  True,  True])

I can’t find any reference to this behaviour in the documentation and looks quite unnatural.

Output of pd.show_versions()

Here the output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.6.4.final.0 python-bits: 64 OS: Darwin OS-release: 16.7.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: en_US.UTF-8 LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.22.0 pytest: None pip: 9.0.1 setuptools: 38.5.2 Cython: None numpy: 1.14.2 scipy: 1.0.0 pyarrow: None xarray: None IPython: 6.2.1 sphinx: None patsy: None dateutil: 2.7.0 pytz: 2018.3 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: 2.2.0 openpyxl: None xlrd: 1.1.0 xlwt: None xlsxwriter: None lxml: None bs4: 4.6.0 html5lib: 1.0.1 sqlalchemy: None pymysql: None psycopg2: None jinja2: 2.10 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None

About this issue

  • Original URL
  • State: open
  • Created 6 years ago
  • Reactions: 15
  • Comments: 17 (10 by maintainers)

Most upvoted comments

I think that people finds useful to not immediately convert to NaN and keep their Python types because they want the objects they put in the series to behave like Python objects. This benefit is lost if you change the way equality works for some of the Python types (like None)!

Consider this piece of code:

In [1]: import pandas as pd
In [2]: A = pd.Series(['a', None, 2])
In [3]: B = pd.Series(['a', None, 2])

In [4]: all(A == B)
Out[4]: False
In [5]: all(lambda a, b: a == b for a, b in zip(A, B))
Out[5]: True

In [5]: all(A.values == B.values)
Out[5]: True
In [6]: all(lambda a, b: a == b for a, b in zip(A.values, B.values))
Out[6]: True

I expect that In [4] should tell me if the two series are equal member by member, vectorizing in some sense the equality. Currently they turn out not to be! But if I do test equality member by member as in In[5] instead I get (as expected) that the series are equal member by member .

Now, if I turn the series to numpy.arrays as in In[5] I got the expected result. Here NumPy is consistent with the vectorization in In[6].

As you can clearly see there is an inconsistency that is very hard to understand (you have to know, and there is no place in the docs where this is clearly stated) that:

  • pandas coverts all missing values except for None to NaN,
  • pandas (according to IEEE) treats NaN as different from themselves.

Having an inconsistent behaviour and leaving None being as such but comparing them as different violates the principle of least astonishment.

I spent quite some time (with two dataframes of tens of thousands of elements of different dtypes) to get why there was a difference when, looking at the corresponding rows, all the values where equal!

I don’t think anyone can really think that an inconsistent behaviour is a good thing…

however this should be consistent with this:In [7]: s = pd.Series([np.nan]*3)

In [8]: s == s
Out[8]: 
0    False
1    False
2    False
dtype: bool

In [9]: s.eq(s)
Out[9]: 
0    False
1    False
2    False
dtype: bool

so [6] form @jschendel should match [9] here. The incorrect case is actually .eq

There is no way to address this without a major point release, since any change here will break backwards compatibility.

Nevertheless, this issue does imply the clear need for documentation improvement: Nones selectively behave as nans.

I do not agree with @jreback and the current status of equality for None in pandas.

I’ve seen the warning box and I completely agree that, according to IEEE floating-point standard, “a NaN is never equal to any other number (including another NaN)” (cited from What Every Computer Scientist Should Know About Floating-Point Arithmetic).

I’ve no problems, if a data type of a series is numeric, to accept that NaNs are not equal and even that None get converted to NaNs.

But it a series contains objects (let them be string, list, or whatever), the equality should be consistent with Python equality for such objects.

It’s a basic application of the principle of least astonishment. I don’t see any benefit in mangling None if the data type is non numeric.

see the warnings box: http://pandas.pydata.org/pandas-docs/stable/missing_data.html

This was done quite a while ago to make the behavior of nulls consistent, in that they don’t compare equal. This puts None and np.nan on an equal (though not-consistent with python, BUT consistent with numpy) footing.

So this is not a bug, rather a consequence of stradling 2 conventions.

I suppose the documentation could be slightly enhanced.

Also a bit strange that == is inconsistent with the eq method in this case:

In [2]: pd.__version__
Out[2]: '0.23.0.dev0+658.g17c1fad'

In [3]: s = pd.Series([None]*3)

In [4]: s
Out[4]:
0    None
1    None
2    None
dtype: object

In [5]: s == s
Out[5]:
0    False
1    False
2    False
dtype: bool

In [6]: s.eq(s)
Out[6]:
0    True
1    True
2    True
dtype: bool

http://pandas-docs.github.io/pandas-docs-travis/missing_data.html#values-considered-missing

@mapio how familiar are you with NumPy / pandas type system? When you do Series([1, None, 2]), we infer it as floats. If you want python objects, specify that.

In [5]: s = pd.Series([1, None, 2], dtype=object)

In [6]: s
Out[6]:
0       1
1    None
2       2
dtype: object

But as you’ve seen, many places in pandas, especially in ops like ==, treat None as missing.