pandas: BUG: None is not equal to None
If you have None in a series, it will not be considered equal to None (even in the same series).
A = pd.Series([None]*3)
A == A
gives
0 False
1 False
2 False
dtype: bool
but, of course
A.values == A.values
gives
array([ True, True, True])
I can’t find any reference to this behaviour in the documentation and looks quite unnatural.
Output of pd.show_versions()
Here the output of pd.show_versions()
pandas: 0.22.0 pytest: None pip: 9.0.1 setuptools: 38.5.2 Cython: None numpy: 1.14.2 scipy: 1.0.0 pyarrow: None xarray: None IPython: 6.2.1 sphinx: None patsy: None dateutil: 2.7.0 pytz: 2018.3 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: 2.2.0 openpyxl: None xlrd: 1.1.0 xlwt: None xlsxwriter: None lxml: None bs4: 4.6.0 html5lib: 1.0.1 sqlalchemy: None pymysql: None psycopg2: None jinja2: 2.10 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None
About this issue
- Original URL
- State: open
- Created 6 years ago
- Reactions: 15
- Comments: 17 (10 by maintainers)
I think that people finds useful to not immediately convert to NaN and keep their Python types because they want the objects they put in the series to behave like Python objects. This benefit is lost if you change the way equality works for some of the Python types (like
None)!Consider this piece of code:
I expect that
In [4]should tell me if the two series are equal member by member, vectorizing in some sense the equality. Currently they turn out not to be! But if I do test equality member by member as inIn[5]instead I get (as expected) that the series are equal member by member .Now, if I turn the series to
numpy.arrays as inIn[5]I got the expected result. Here NumPy is consistent with the vectorization inIn[6].As you can clearly see there is an inconsistency that is very hard to understand (you have to know, and there is no place in the docs where this is clearly stated) that:
Noneto NaN,Having an inconsistent behaviour and leaving
Nonebeing as such but comparing them as different violates the principle of least astonishment.I spent quite some time (with two dataframes of tens of thousands of elements of different dtypes) to get why there was a difference when, looking at the corresponding rows, all the values where equal!
I don’t think anyone can really think that an inconsistent behaviour is a good thing…
however this should be consistent with this:In [7]: s = pd.Series([np.nan]*3)
so [6] form @jschendel should match [9] here. The incorrect case is actually
.eqThere is no way to address this without a major point release, since any change here will break backwards compatibility.
Nevertheless, this issue does imply the clear need for documentation improvement:
Nones selectively behave asnans.I do not agree with @jreback and the current status of equality for
Nonein pandas.I’ve seen the warning box and I completely agree that, according to IEEE floating-point standard, “a NaN is never equal to any other number (including another NaN)” (cited from What Every Computer Scientist Should Know About Floating-Point Arithmetic).
I’ve no problems, if a data type of a series is numeric, to accept that NaNs are not equal and even that
Noneget converted to NaNs.But it a series contains objects (let them be string, list, or whatever), the equality should be consistent with Python equality for such objects.
It’s a basic application of the principle of least astonishment. I don’t see any benefit in mangling
Noneif the data type is non numeric.see the warnings box: http://pandas.pydata.org/pandas-docs/stable/missing_data.html
This was done quite a while ago to make the behavior of nulls consistent, in that they don’t compare equal. This puts
Noneandnp.nanon an equal (though not-consistent with python, BUT consistent with numpy) footing.So this is not a bug, rather a consequence of stradling 2 conventions.
I suppose the documentation could be slightly enhanced.
Also a bit strange that
==is inconsistent with theeqmethod in this case:http://pandas-docs.github.io/pandas-docs-travis/missing_data.html#values-considered-missing
@mapio how familiar are you with NumPy / pandas type system? When you do
Series([1, None, 2]), we infer it as floats. If you want python objects, specify that.But as you’ve seen, many places in pandas, especially in ops like
==, treat None as missing.