pandas: BUG: Replacing `pd.NA` by `None` has no effect
Pandas version checks
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
df = pd.DataFrame({"value": [42, None]}).astype({"value": "Int64"})
assert df.replace({pd.NA: None}).to_dict() == {"value": {0: 42, 1: None}}
Issue Description
Since version 1.4.0 .replace({pd.NA: None})
has no effect, pd.NA
is not be replaced by None
anymore.
Expected Behavior
The assertion in the example shown above is fulfilled in version 1.3.5.
Installed Versions
INSTALLED VERSIONS
commit : bb1f651536508cdfef8550f93ace7849b00046ee python : 3.10.1.final.0 python-bits : 64 OS : Linux OS-release : 5.15.13-arch1-1 Version : #1 SMP PREEMPT Wed, 05 Jan 2022 16:20:59 +0000 machine : x86_64 processor : byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : de_DE.UTF-8
pandas : 1.4.0 numpy : 1.21.5 pytz : 2021.3 dateutil : 2.8.2 pip : 21.2.4 setuptools : 58.1.0 Cython : None pytest : 6.2.5 hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 3.0.3 IPython : None pandas_datareader: None bs4 : None bottleneck : None fastparquet : None fsspec : None gcsfs : None matplotlib : 3.5.1 numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyreadstat : None pyxlsb : None s3fs : None scipy : None sqlalchemy : 1.4.29 tables : None tabulate : None xarray : None xlrd : None xlwt : None zstandard : None
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Reactions: 6
- Comments: 15 (13 by maintainers)
Commits related to this issue
- code sample for #45601 — committed to simonjayhawkins/pandas by simonjayhawkins 2 years ago
@roib20 thanks for the investigation!
I also wanted to highlight the solution / workaround that you already suggested as well to actually get
None
in both old and new pandas versions, which is explicitly casting toobject
dtype:This way you ensure that you have a data type that can hold any value exactly as you want it (in this case
None
)But so the underlying question, as you mentioned above, is indeed: should operations like
replace
preserve the dtype (or try to preserve it if possible)? I personally think that for setitem or fillna operations we should preserve the dtype (https://github.com/pandas-dev/pandas/issues/39584, https://github.com/pandas-dev/pandas/issues/25288). Butreplace
is something different (eg you could use it to change a set of string values to numeric value with a given replacement dictionary), and we might want to be more flexible. But we should probably still try to preserve the dtype. And then the issue here is: isNone
a valid value for Int64 dtype or not? (in constructors, it is generally considered as a valid value … but for replace we should maybe be stricter).Small note on this: I don’t think a potential solution can use
to_native_types
here, because that will convert to object dtype, which should be avoided if not needed.Aside: we should maybe add a
dtype
keyword toreplace()
to be able to specify the target dtype of the values after replacement, in case pandas inference doesn’t do it correctly (as long as we try to preserve dtype but fallback to other, we will always do some dtype inference, and dtype inference can always do something else as what you wanted.