pandas: BUG: merging on int32 platforms with large blocks
Hello everyone,
I am trying to merge a ridiculously large dataframe with another ridiculously smaller one and I get
df=df.merge(slave,left_on='buyer',right_on='NAME',how='left')
OverflowError: Python int too large to convert to C long
Ram is filled at 56% prior to the merge. Am I hitting some limitations here?
master dataframe
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 80162624 entries, 0 to 90320839
Data columns (total 38 columns):
index int64
dtypes: datetime64[ns](2), float32(1), int64(3), object(32)
memory usage: 23.0+ GB
dataframe I would like to merge to the master
slave.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 55394 entries, 0 to 55393
Data columns (total 6 columns):
dtypes: object(6)
memory usage: 2.5+ MB
I am using the latest Anaconda distribution (that is, with Pandas 18.0) Thanks for your help!
About this issue
- Original URL
- State: open
- Created 8 years ago
- Comments: 16 (6 by maintainers)
@randomgambit no its good that you are pushing on this. I think there is a bug here. Just need someone to repro in a sane way 😃
you might also try your problem with: https://dask.readthedocs.io/en/latest/
this is actually suited very well for an out-of-core join.
thank you Jeff. I appreciate your help. I hope my insanely-large-dataframes-issues are helpful to the community 😉
then
.mapwon’t work for that. but you should really examine your problem. It is often much better to simply merge a small part of a bigger dataframe, then bring in the other columns.Unfortunately I cannot help you any more here.
So first off, what you are trying to do is completelny inefficient. unless you are doing a multi-multi merge, you will prob be better off using
.map.On windows this is very likely to blow up as int32 are used as pointers. I don’t really now specifically why this is blowing up as I can’t allocate that much memory. So it is indexing past int32, which normally is not a problem (for numpy), but the line is a python line. So not sure the issue.
You have lots of object dtypes. You need to make sure that these are strings (and NOT objects), e.g. an embedded integer.
Further you should simply categorize things to use less memory. Try doing a smaller frame merge (e.g. less columns), or simply get more memory.
pls show the full trace back
at a start show the dtypes of the merging columns plus a data sample also pd.show_versions()