pandas: Pivot / unstack on large data frame does not work int32 overflow

Code Sample, a copy-pastable example if possible

predictor_purchases_p = predictor_purchases.groupby(["ag", "artikelnr"])["som"].max().unstack().fillna(0)

or

predictor_purchases_p = predictor_purchases.pivot(index="ag", columns="artikelnr", value="som")

Problem description

I’m working on rather large data (>100GB in memory) on a beefy server (3TB ram) When refactoring my code from pandas 0.21 to latest version, the pivot / unstack now returns an exception.

Unstacked DataFrame is too big, causing int32 overflow

I was able to eliminate the problem by changing the reshape.py: modify line 121from dtype np.int32 to dtype np.int64: num_cells = np.multiply(num_rows, num_columns, dtype=np.int64)

Expected Output

Not being limited by int32 dims when reshaping a data frame. This feels like a restriction which should not be there.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None python: 3.7.3.final.0 python-bits: 64 OS: Linux OS-release: 3.10.0-862.el7.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.24.2 pytest: None pip: 19.1 setuptools: 41.0.1 Cython: 0.29.7 numpy: 1.16.3 scipy: 1.2.1 pyarrow: None xarray: 0.12.1 IPython: 7.5.0 sphinx: None patsy: 0.5.1 dateutil: 2.8.0 pytz: 2019.1 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: 3.0.3 openpyxl: 2.6.2 xlrd: None xlwt: None xlsxwriter: None lxml.etree: None bs4: None html5lib: None sqlalchemy: 1.3.3 pymysql: None psycopg2: None jinja2: 2.10.1 s3fs: None fastparquet: 0.3.0 pandas_gbq: None pandas_datareader: None gcsfs: None

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Reactions: 38
  • Comments: 47 (13 by maintainers)

Most upvoted comments

The point I’m trying to raise is: why is the number of cells limited to max value of np.int32? num_cells = np.multiply(num_rows, num_columns, dtype=np.int32)

This creates constraints when working with large data frames. Basically, I’m proposing to change this this to np.int64.

My current dataset has RangeIndex: 2584251 entries Columns: 4539 entries

num_cells = 2584251 * 4539 = 11.729.915.289 cells

So, I have 11.7 B cells

Putting a int32 contstraint on number of cells is way to small for my datasets. I’m quite sure this causes problems for other users.

Any news on this, I’m failing to roll back to 0.21. My next course of action is to rewrite vectorised function as a loop but I’d rather not.

I’m getting this error on 6Gb of memory use with pivoting the movielens large ratings.csv. So I agree in this day and age we need int64.

Yeah, I’m bumping into this as well. I’m trying to make 2.87 billion cells in my unstack. I saw issue #20601 and that basically throwing a more informative error made more sense than increasing to from int32 to int64. I kind of agree with that assessment. It would be nice to have an option in unstack that would let you tell it that you expect to have a very large number of cells and switch to int64 index.

I’m not sure how difficult this would be or if it would be worth it to satisfy people with long lists of product-user pairs like me.

Some suggestions were to downgrade to pandas==0.21 which not really a feasible solution!

I faced the same issue and needed to have an urgent fix for the unexpected int32 overflow. One of our recommendation model was running in production and at some point number of users base increased to more than 7 million records with around 21k items.

So, to solve the issue I chunked the dataset as mentioned @igorkf, create the pivot table using unstack and append it gradually.

import pandas as pd 
from tqdm import tqdm

chunk_size = 50000
chunks = [x for x in range(0, df.shape[0], chunk_size)]

for i in range(0, len(chunks) - 1):
    print(chunks[i], chunks[i + 1] - 1)
0 49999
50000 99999
100000 149999
150000 199999
200000 249990
.........................



pivot_df = pd.DataFrame()

for i in tqdm(range(0, len(chunks) - 1)):
    chunk_df = df.iloc[ chunks[i]:chunks[i + 1] - 1]
    interactions = (
    chunk_df.groupby([user_col, item_col])[rating_col]
    .sum()
    .unstack()
    .reset_index()
    .fillna(0)
    .set_index(user_col)
    )
    print (interactions.shape)
    pivot_df = pivot_df.append(interactions, sort=False) 
And then I have to make a sparse matrix as input to lightFM recommendation model (run matrix-factorization algorithm). You can use it for any use case where unstacking is required. Using the following code, converted to sparse matrix-

from scipy import sparse
import numpy as np
sparse_matrix = sparse.csr_matrix(df_new.to_numpy())

NB: Pandas has pivot_table function which can be used for unstacking if your data is small. For my case, pivot_table was really slow

Still open, still looking for a volunteer to fix it.

@KaonToPion also waiting on this fix. If you could please take another effort a try that would be great. Thank you!

I have been stuck with this some days and finally I have fixed it by changing int32 to int64. Would it be all right to send a pull request with it?

I got this error in ver. 0.25.3, any news on it?

I get “ValueError: negative dimensions are not allowed” after changing the reshape.py line to num_cells = np.multiply(num_rows, num_columns, dtype=np.int64).

Any chance we have a different workaround? I only have 6000 columns…

Turns out I get the same error even after changing the reshape.py line to num_cells = np.multiply(num_rows, num_columns, dtype=np.int64), although the error definitely looks like it occurs there.