pandas: BUG: DataFrame outer merge changes key columns from int64 to float64
import pandas as pd
df1 = pd.DataFrame({'key': [1,2,3,4], 'val1': [1,2,3,4]})
df2 = pd.DataFrame({'key': [1,2,3,5], 'val2': [1,2,3,4]})
df = df1.merge(df2, how='outer')
Was expecting key to stay int64, since a merge can’t introduce missing key values if none were present in the inputs.
print df.dtypes
key float64
val1 float64
val2 float64
dtype: object
Version 0.15.0-6-g403f38d
About this issue
- Original URL
- State: closed
- Created 10 years ago
- Reactions: 19
- Comments: 18 (11 by maintainers)
Commits related to this issue
- BUG: preserve merge keys dtypes when possible closes #8596 xref to #13169 as assignment of Index of bools not retaining dtype — committed to jreback/pandas by jreback 8 years ago
I definitely lose data on the int->float conversion during the merge. When i cast back from float->int my id’s stop working.
Any idea how this can be avoided if at all?
@theholy7 old-school numpy integers don’t have NaN values, so the only option is to use a different dtype. Depending on the performance drop you can bear your options are:
dtype='object'– probably worst possible case, all values are stored as garbage collected PyObjects (consuming 28 bytes instead of 8) and no vectorization is possibledtype='category'– all values are stored as integer indexes into a different numpy array, much cheaper memory wise, but arithmetic operations might be trickydtype='Int64'– new kid on the block, should be most efficient on both memory and arithmetics, but still marked as experimental API