pandas: BUG: Replacing `pd.NA` by `None` has no effect

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

df = pd.DataFrame({"value": [42, None]}).astype({"value": "Int64"})

assert df.replace({pd.NA: None}).to_dict() == {"value": {0: 42, 1: None}}

Issue Description

Since version 1.4.0 .replace({pd.NA: None}) has no effect, pd.NA is not be replaced by None anymore.

Expected Behavior

The assertion in the example shown above is fulfilled in version 1.3.5.

Installed Versions

INSTALLED VERSIONS

commit : bb1f651536508cdfef8550f93ace7849b00046ee python : 3.10.1.final.0 python-bits : 64 OS : Linux OS-release : 5.15.13-arch1-1 Version : #1 SMP PREEMPT Wed, 05 Jan 2022 16:20:59 +0000 machine : x86_64 processor : byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : de_DE.UTF-8

pandas : 1.4.0 numpy : 1.21.5 pytz : 2021.3 dateutil : 2.8.2 pip : 21.2.4 setuptools : 58.1.0 Cython : None pytest : 6.2.5 hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 3.0.3 IPython : None pandas_datareader: None bs4 : None bottleneck : None fastparquet : None fsspec : None gcsfs : None matplotlib : 3.5.1 numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyreadstat : None pyxlsb : None s3fs : None scipy : None sqlalchemy : 1.4.29 tables : None tabulate : None xarray : None xlrd : None xlwt : None zstandard : None

About this issue

Original URL
State: closed
Created 2 years ago
Reactions: 6
Comments: 15 (13 by maintainers)

Commits related to this issue

code sample for #45601 — committed to simonjayhawkins/pandas by simonjayhawkins 2 years ago

Most upvoted comments

@roib20 thanks for the investigation!

I also wanted to highlight the solution / workaround that you already suggested as well to actually get None in both old and new pandas versions, which is explicitly casting to object dtype:

In [23]: df.astype(object).replace({pd.NA: None}).to_dict()
Out[23]: {'value': {0: 42, 1: None}}

This way you ensure that you have a data type that can hold any value exactly as you want it (in this case None)

But so the underlying question, as you mentioned above, is indeed: should operations like replace preserve the dtype (or try to preserve it if possible)? I personally think that for setitem or fillna operations we should preserve the dtype (https://github.com/pandas-dev/pandas/issues/39584, https://github.com/pandas-dev/pandas/issues/25288). But replace is something different (eg you could use it to change a set of string values to numeric value with a given replacement dictionary), and we might want to be more flexible. But we should probably still try to preserve the dtype. And then the issue here is: is None a valid value for Int64 dtype or not? (in constructors, it is generally considered as a valid value … but for replace we should maybe be stricter).

I did however write a small fix for this bug that makes the behavior more consistent with 1.3.5. I wrote this commit 0dd5a1d that appears to solve this issue and #45729, however I am waiting for more input on the desired behavior before making a pull request with more tests.

Small note on this: I don’t think a potential solution can use to_native_types here, because that will convert to object dtype, which should be avoided if not needed.

Aside: we should maybe add a dtype keyword to replace() to be able to specify the target dtype of the values after replacement, in case pandas inference doesn’t do it correctly (as long as we try to preserve dtype but fallback to other, we will always do some dtype inference, and dtype inference can always do something else as what you wanted.

jorisvandenbossche on Feb 3, 2022