arrow: [Python] OSError: List index overflow.
Hello,
I am storing pandas dataframe as .parquet with pd.to_parquet and then try to load them back with pd.read_parquet. I am experiencing some error for which I do not find solution and would kindly ask for help to solve this please …
Here is the trace:
File "/home/gnlzm/miniconda3/envs/antidoto/lib/python3.9/site-packages/pandas/io/parquet.py", line 493, in read_parquet
return impl.read(
File "/home/gnlzm/miniconda3/envs/antidoto/lib/python3.9/site-packages/pandas/io/parquet.py", line 240, in read
result = self.api.parquet.read_table(
File "/home/gnlzm/miniconda3/envs/antidoto/lib/python3.9/site-packages/pyarrow/parquet/__init__.py", line 2827, in read_table
return dataset.read(columns=columns, use_threads=use_threads,
File "/home/gnlzm/miniconda3/envs/antidoto/lib/python3.9/site-packages/pyarrow/parquet/__init__.py", line 2473, in read
table = self._dataset.to_table(
File "pyarrow/_dataset.pyx", line 331, in pyarrow._dataset.Dataset.to_table
File "pyarrow/_dataset.pyx", line 2577, in pyarrow._dataset.Scanner.to_table
File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status
OSError: List index overflow.
If I store a small dataframe, I do not face this error. If I store a larger dataframe with e.g. 295.912.999 rows then I get this error.
However before saving it, I print the index range and it is bound in 0 295912998. Whether I save the .parquet with index=True or False gives the same error but I do not understand why there is an overflow on the bounded index …
Any hints are much appreciated, thanks !
About this issue
- Original URL
- State: open
- Created 2 years ago
- Reactions: 2
- Comments: 22 (14 by maintainers)
FWIW I’ve had success using polars to read large parquet containing columns of lists of float64s and converting them to pandas afterwards when
pd.read_parquet
caused an error.Managed to reproduced this error from a dataset with a single column containing a list of integers.
Tested with
pyarrow-9.0.0
andpandas-1.5
. Loading dataset failed withOSError: List index overflow.
.NOTE: loading the dataset leads to “out of memory kill” on a machine with 128G RAM. I have to test it on a 256G RAM machine.