pandas: Pivot / unstack on large data frame does not work int32 overflow
Code Sample, a copy-pastable example if possible
predictor_purchases_p = predictor_purchases.groupby(["ag", "artikelnr"])["som"].max().unstack().fillna(0)
or
predictor_purchases_p = predictor_purchases.pivot(index="ag", columns="artikelnr", value="som")
Problem description
I’m working on rather large data (>100GB in memory) on a beefy server (3TB ram) When refactoring my code from pandas 0.21 to latest version, the pivot / unstack now returns an exception.
Unstacked DataFrame is too big, causing int32 overflow
I was able to eliminate the problem by changing the reshape.py:
modify line 121from dtype np.int32 to dtype np.int64:
num_cells = np.multiply(num_rows, num_columns, dtype=np.int64)
Expected Output
Not being limited by int32 dims when reshaping a data frame. This feels like a restriction which should not be there.
Output of pd.show_versions()
INSTALLED VERSIONS
commit: None python: 3.7.3.final.0 python-bits: 64 OS: Linux OS-release: 3.10.0-862.el7.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8
pandas: 0.24.2 pytest: None pip: 19.1 setuptools: 41.0.1 Cython: 0.29.7 numpy: 1.16.3 scipy: 1.2.1 pyarrow: None xarray: 0.12.1 IPython: 7.5.0 sphinx: None patsy: 0.5.1 dateutil: 2.8.0 pytz: 2019.1 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: 3.0.3 openpyxl: 2.6.2 xlrd: None xlwt: None xlsxwriter: None lxml.etree: None bs4: None html5lib: None sqlalchemy: 1.3.3 pymysql: None psycopg2: None jinja2: 2.10.1 s3fs: None fastparquet: 0.3.0 pandas_gbq: None pandas_datareader: None gcsfs: None
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Reactions: 38
- Comments: 47 (13 by maintainers)
The point I’m trying to raise is: why is the number of cells limited to max value of np.int32?
num_cells = np.multiply(num_rows, num_columns, dtype=np.int32)
This creates constraints when working with large data frames. Basically, I’m proposing to change this this to np.int64.
My current dataset has RangeIndex: 2584251 entries Columns: 4539 entries
num_cells = 2584251 * 4539 = 11.729.915.289 cells
So, I have 11.7 B cells
Putting a int32 contstraint on number of cells is way to small for my datasets. I’m quite sure this causes problems for other users.
Any news on this, I’m failing to roll back to 0.21. My next course of action is to rewrite vectorised function as a loop but I’d rather not.
I’m getting this error on 6Gb of memory use with pivoting the movielens large ratings.csv. So I agree in this day and age we need int64.
Yeah, I’m bumping into this as well. I’m trying to make 2.87 billion cells in my unstack. I saw issue #20601 and that basically throwing a more informative error made more sense than increasing to from int32 to int64. I kind of agree with that assessment. It would be nice to have an option in unstack that would let you tell it that you expect to have a very large number of cells and switch to int64 index.
I’m not sure how difficult this would be or if it would be worth it to satisfy people with long lists of product-user pairs like me.
Some suggestions were to downgrade to pandas==0.21 which not really a feasible solution!
I faced the same issue and needed to have an urgent fix for the unexpected int32 overflow. One of our recommendation model was running in production and at some point number of users base increased to more than 7 million records with around 21k items.
So, to solve the issue I chunked the dataset as mentioned @igorkf, create the pivot table using unstack and append it gradually.
NB: Pandas has pivot_table function which can be used for unstacking if your data is small. For my case, pivot_table was really slow
Still open, still looking for a volunteer to fix it.
@KaonToPion also waiting on this fix. If you could please take another effort a try that would be great. Thank you!
I have been stuck with this some days and finally I have fixed it by changing int32 to int64. Would it be all right to send a pull request with it?
I got this error in ver.
0.25.3
, any news on it?I get “ValueError: negative dimensions are not allowed” after changing the reshape.py line to num_cells = np.multiply(num_rows, num_columns, dtype=np.int64).
Any chance we have a different workaround? I only have 6000 columns…
Turns out I get the same error even after changing the reshape.py line to
num_cells = np.multiply(num_rows, num_columns, dtype=np.int64)
, although the error definitely looks like it occurs there.