pandas: BUG: merging on int32 platforms with large blocks

Hello everyone,

I am trying to merge a ridiculously large dataframe with another ridiculously smaller one and I get

df=df.merge(slave,left_on='buyer',right_on='NAME',how='left') OverflowError: Python int too large to convert to C long

Ram is filled at 56% prior to the merge. Am I hitting some limitations here?

master dataframe

df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 80162624 entries, 0 to 90320839
Data columns (total 38 columns):
index                      int64

dtypes: datetime64[ns](2), float32(1), int64(3), object(32)
memory usage: 23.0+ GB

dataframe I would like to merge to the master

slave.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 55394 entries, 0 to 55393
Data columns (total 6 columns):
dtypes: object(6)
memory usage: 2.5+ MB

I am using the latest Anaconda distribution (that is, with Pandas 18.0) Thanks for your help!

About this issue

Original URL
State: open
Created 8 years ago
Comments: 16 (6 by maintainers)

Most upvoted comments

@randomgambit no its good that you are pushing on this. I think there is a bug here. Just need someone to repro in a sane way 😃

you might also try your problem with: https://dask.readthedocs.io/en/latest/

this is actually suited very well for an out-of-core join.

jreback on May 16, 2016

thank you Jeff. I appreciate your help. I hope my insanely-large-dataframes-issues are helpful to the community 😉

randomgambit on May 16, 2016

then .map won’t work for that. but you should really examine your problem. It is often much better to simply merge a small part of a bigger dataframe, then bring in the other columns.

Unfortunately I cannot help you any more here.

jreback on May 16, 2016

So first off, what you are trying to do is completelny inefficient. unless you are doing a multi-multi merge, you will prob be better off using .map.

On windows this is very likely to blow up as int32 are used as pointers. I don’t really now specifically why this is blowing up as I can’t allocate that much memory. So it is indexing past int32, which normally is not a problem (for numpy), but the line is a python line. So not sure the issue.

In [20]: total_len = 80162624*32

In [21]: chunk_len = max(total_len // 40, 1000)

In [22]: indexer = np.array(list(range(0, total_len, chunk_len)))

In [23]: ml = np.iinfo('int32').max

In [24]: indexer>ml
Out[24]:
array([False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False,  True,  True,
        True,  True,  True,  True,  True], dtype=bool)

In [25]: indexer
Out[25]:
array([         0,   64130099,  128260198,  192390297,  256520396,
        320650495,  384780594,  448910693,  513040792,  577170891,
        641300990,  705431089,  769561188,  833691287,  897821386,
        961951485, 1026081584, 1090211683, 1154341782, 1218471881,
       1282601980, 1346732079, 1410862178, 1474992277, 1539122376,
       1603252475, 1667382574, 1731512673, 1795642772, 1859772871,
       1923902970, 1988033069, 2052163168, 2116293267, 2180423366,
       2244553465, 2308683564, 2372813663, 2436943762, 2501073861,
       2565203960], dtype=int64)

You have lots of object dtypes. You need to make sure that these are strings (and NOT objects), e.g. an embedded integer.

Further you should simply categorize things to use less memory. Try doing a smaller frame merge (e.g. less columns), or simply get more memory.

jreback on May 16, 2016

pls show the full trace back

jreback on May 16, 2016

at a start show the dtypes of the merging columns plus a data sample also pd.show_versions()

jreback on May 16, 2016