pandas: BUG: DataFrame outer merge changes key columns from int64 to float64

import pandas as pd

df1 = pd.DataFrame({'key': [1,2,3,4], 'val1': [1,2,3,4]})
df2 = pd.DataFrame({'key': [1,2,3,5], 'val2': [1,2,3,4]})

df = df1.merge(df2, how='outer')

Was expecting key to stay int64, since a merge can’t introduce missing key values if none were present in the inputs.

print df.dtypes

key     float64
val1    float64
val2    float64
dtype: object

Version 0.15.0-6-g403f38d

About this issue

  • Original URL
  • State: closed
  • Created 10 years ago
  • Reactions: 19
  • Comments: 18 (11 by maintainers)

Commits related to this issue

Most upvoted comments

I definitely lose data on the int->float conversion during the merge. When i cast back from float->int my id’s stop working.

Any idea how this can be avoided if at all?

@theholy7 old-school numpy integers don’t have NaN values, so the only option is to use a different dtype. Depending on the performance drop you can bear your options are:

  • dtype='object' – probably worst possible case, all values are stored as garbage collected PyObjects (consuming 28 bytes instead of 8) and no vectorization is possible
  • dtype='category' – all values are stored as integer indexes into a different numpy array, much cheaper memory wise, but arithmetic operations might be tricky
  • dtype='Int64'new kid on the block, should be most efficient on both memory and arithmetics, but still marked as experimental API