arrow: [Python] OSError: List index overflow.

Hello,

I am storing pandas dataframe as .parquet with pd.to_parquet and then try to load them back with pd.read_parquet. I am experiencing some error for which I do not find solution and would kindly ask for help to solve this please …

Here is the trace:

  File "/home/gnlzm/miniconda3/envs/antidoto/lib/python3.9/site-packages/pandas/io/parquet.py", line 493, in read_parquet
    return impl.read(
  File "/home/gnlzm/miniconda3/envs/antidoto/lib/python3.9/site-packages/pandas/io/parquet.py", line 240, in read
    result = self.api.parquet.read_table(
  File "/home/gnlzm/miniconda3/envs/antidoto/lib/python3.9/site-packages/pyarrow/parquet/__init__.py", line 2827, in read_table
    return dataset.read(columns=columns, use_threads=use_threads,
  File "/home/gnlzm/miniconda3/envs/antidoto/lib/python3.9/site-packages/pyarrow/parquet/__init__.py", line 2473, in read
    table = self._dataset.to_table(
  File "pyarrow/_dataset.pyx", line 331, in pyarrow._dataset.Dataset.to_table
  File "pyarrow/_dataset.pyx", line 2577, in pyarrow._dataset.Scanner.to_table
  File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status
OSError: List index overflow.

If I store a small dataframe, I do not face this error. If I store a larger dataframe with e.g. 295.912.999 rows then I get this error.

However before saving it, I print the index range and it is bound in 0 295912998. Whether I save the .parquet with index=True or False gives the same error but I do not understand why there is an overflow on the bounded index …

Any hints are much appreciated, thanks !

About this issue

  • Original URL
  • State: open
  • Created 2 years ago
  • Reactions: 2
  • Comments: 22 (14 by maintainers)

Most upvoted comments

FWIW I’ve had success using polars to read large parquet containing columns of lists of float64s and converting them to pandas afterwards when pd.read_parquet caused an error.

Managed to reproduced this error from a dataset with a single column containing a list of integers.

  • to generate the dataset
import numpy as np
import pandas as pd

# total rows < max(int32)
n_rows = 108000000

# dataframe has only one column containing a list of 200 integers
# 200 * n_rows > max(int32)
data = [np.zeros(200, dtype='int8')] * n_rows

print('generating...')
df = pd.DataFrame()
# only one column
df['a'] = data

print('saving ...')
df.to_parquet('/tmp/pq')
print('done')
  • to load the dataset
import pandas as pd

print('loading...')
df = pd.read_parquet('/tmp/pq', use_threads=False)
print('size = {}'.format(df.shape))

Tested with pyarrow-9.0.0 and pandas-1.5. Loading dataset failed with OSError: List index overflow..

NOTE: loading the dataset leads to “out of memory kill” on a machine with 128G RAM. I have to test it on a 256G RAM machine.