pandas: BUG: DataFrame outer merge changes key columns from int64 to float64

import pandas as pd

df1 = pd.DataFrame({'key': [1,2,3,4], 'val1': [1,2,3,4]})
df2 = pd.DataFrame({'key': [1,2,3,5], 'val2': [1,2,3,4]})

df = df1.merge(df2, how='outer')

Was expecting key to stay int64, since a merge can’t introduce missing key values if none were present in the inputs.

print df.dtypes

key     float64
val1    float64
val2    float64
dtype: object

Version 0.15.0-6-g403f38d

About this issue

Original URL
State: closed
Created 10 years ago
Reactions: 19
Comments: 18 (11 by maintainers)

Commits related to this issue

BUG: preserve merge keys dtypes when possible closes #8596 xref to #13169 as assignment of Index of bools not retaining dtype — committed to jreback/pandas by jreback 8 years ago

Most upvoted comments

I definitely lose data on the int->float conversion during the merge. When i cast back from float->int my id’s stop working.

Any idea how this can be avoided if at all?

theholy7 on Mar 6, 2020

@theholy7 old-school numpy integers don’t have NaN values, so the only option is to use a different dtype. Depending on the performance drop you can bear your options are:

dtype='object' – probably worst possible case, all values are stored as garbage collected PyObjects (consuming 28 bytes instead of 8) and no vectorization is possible
dtype='category' – all values are stored as integer indexes into a different numpy array, much cheaper memory wise, but arithmetic operations might be tricky
dtype='Int64' – new kid on the block, should be most efficient on both memory and arithmetics, but still marked as experimental API

immerrr on Mar 15, 2020