pandas: Default value for missing values on merge
This is a reopening of https://github.com/pandas-dev/pandas/issues/1836. The suggestion there was to add a parameter to pd.merge, such as fillvalue, whose value would be used instead of NaN for missing values. This isn’t simply solved by fillna since adding NaN to columns casts them to float.
https://github.com/pandas-dev/pandas/issues/1836 also asked to provide an example where this would be useful. Admittedly, in my case there might be a simpler solution than merge, but anyway.
I have a DataFrame with a single column which is basically an index: it contains distinct numbers. I also have a DataFrame where one column contains some (but not all) values from the same index, while others contain useful data. I want to extend this DataFrame to include all values from the index, filling the other columns with zeros. I do this by calling
pd.merge(df_with_index, smaller_df_with_data, on='col_index', how='outer').fillna(0)
and end up with a DataFrame where all columns except for col_index are cast to float.
Output of pd.show_versions()
INSTALLED VERSIONS
------------------
commit: None
python: 3.6.4.final.0
python-bits: 64
OS: Linux
OS-release: 4.14.23-1-MANJARO
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.22.0
pytest: None
pip: None
setuptools: 38.5.1
Cython: 0.27.3
numpy: 1.14.0
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2018.3
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.1.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Reactions: 6
- Comments: 16 (7 by maintainers)
you can add
.fillna(0, downcast='infer')if you want.This is just adding more complexity to an already very complex
pd.merge. we already have.fillnafor this purpose.So I just stumbled into this thread looking for the solution the OP is asking.
I have a different case, in which I have a
DataFramewith NaNs, which should remain NaNs, which I need to merge with anotherDataFramewhich has only partial information for the firstDataFrame:In this example,
all.csvcontains missing information in some columns, some rows, and I don’t want to make up some fake value for them (and later on I will most likely want to filter out those rows, only in some cases, and depending on whether I am selecting that column with NaNs or not).activity.csvcontains activity data for only some of thedealIds inall.csv, and I do want thedealIds that are not inactivity.csvto be filled up with 0 (no activity). However, if I use the code above I will end up with aDataFramewhere the NaNs in the originalall.csvDataFrameare replaced by 0, which is not what I intend to do.One possible workaround would be something like this:
But this feels like something I should not be doing.
Same issue here. Like @GSanchis I want to left-join (merge) two DataFrames where, in my case, the second one has a single column of ints or Strings where the missing values need to filled with 0 or empty String, respectively. This is very common use case in many data science/data mining task. But that automatic conversion to float type for capturing nan with no means to fill the nan values without much effort always breaks stuff. With all due respect for the greatness of pandas I really don’t understand why such basic requirements for data manipulation won’t be fixed?
+1 Fillna is no solution. If yours merge produces NULLs for an int64 column, casting to floats is a disaster since the integer values are lost when casting to floats.
It does handle the NA type. NA ist the missing value indicator for extension dtypes, e.g. if you are merging
you keep
Int64dtype. But if you change the dtypes toint64np.nanis used, hence the conversion to float.Hello this feature is needed since on very large and complex dataset we focus on memory size and pandas changes column types to float32 (even if you have int8) 😦
Thanks. Can understand now.
But maybe the docu can improved here? I am not sure about it.
Despite explaining the technical details and backgrounds just add a FAQ like section to
merge()docu Why merge convert my integer column to float and how to solve this.For newbies it is quite hard to understand the difference between
Int64andint64.fillna()is definitely not a solution but a workaround.intcolumn and have anintcolumn.merge()converts thatintcolumn tofloat(because of missing values).0).int.Does anyone disagree that this is only a workaround but not a solution?
But I am not sure if we need an
fillnaargument inmerge()because we also could discussed about whymerge()does the conversion!pandas.NAis not alwaysfloatanymore!You can have
NAint integer columns.Question: Why does not
merge()handle theNAtype by itself?+1 we need this feature
Perhaps something similar to
fill_valueinreindex?I have a similar issue: Working only with string data, the missing values still becomes NaN, which makes no sense in a string column. My target is to have None values.
The proposed fix .fillna(value=None, downcast=‘defer’) does not work, because fillna thinks the value parameter is missing. “ValueError: Must specify a fill ‘value’ or ‘method’.”