pandas: BUG: MemoryError on reading big HDF5 files

Code Sample, a copy-pastable example if possible

import pandas as pd
store = pd.get_store('big1.h5')
i = 0
for df in store.select('/MeasurementSamples', chunksize=100):
    i += 1
print(i)
store.close()

Result:

Traceback (most recent call last):
  File "memerror.py", line 6, in <module>
    for df in store.select(experiment, chunksize=100):
  File "/home/chwalisz/Code/ext_tools/pandas/pandas/io/pytables.py", line 721, in select
    return it.get_result()
  File "/home/chwalisz/Code/ext_tools/pandas/pandas/io/pytables.py", line 1409, in get_result
    self.coordinates = self.s.read_coordinates(where=self.where)
  File "/home/chwalisz/Code/ext_tools/pandas/pandas/io/pytables.py", line 3652, in read_coordinates
    coords = self.selection.select_coords()
  File "/home/chwalisz/Code/ext_tools/pandas/pandas/io/pytables.py", line 4718, in select_coords
    return np.arange(start, stop)
MemoryError
Closing remaining open files:big1.h5...done

Problem description

I’m not able to iterate over the chunks of file when the index array is to big and cannot fit into memory. I can also mention that I’m able to view the data with ViTables (that use PyTables internally to load data).

I’m using more less following code to create file (writing to it long enough to have 20GB of data).

import tables as tb

class FreqSample(tb.IsDescription):
    tsf = tb.Int64Col(dflt=-1)  # [us] TSF value ticks in micro seconds
    timestamp = tb.Int64Col()  # [ns] Epoch time
    frequency = tb.Float64Col()
    power = tb.Float64Col()

h5filters = tb.Filters(complib='blosc', complevel=5)
h5file = tb.open_file(fname, mode="a",
           title=title,
           filters=h5filters)
tab = h5file.create_table('/Measurement', 'a',  FreqSample)

try:
    while True:
        row = tab.row
        row['tsf'] = 1
        row['timestamp'] = 2
        row['frequency'] = 3
        row['power'] = 4
        row.append()
except:
    pass
tab.autoindex = True
tab.flush()
h5file.close()

Expected Output

I would expect the above code prints number of chunks.

Output of pd.show_versions()

>>> pd.show_versions()

INSTALLED VERSIONS

commit: None python: 3.6.1.final.0 python-bits: 64 OS: Linux OS-release: 4.9.10-040910-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.19.0+739.g7b82e8b pytest: 3.0.7 pip: 9.0.1 setuptools: 27.2.0 Cython: 0.25.2 numpy: 1.12.1 scipy: 0.19.0 xarray: None IPython: 5.3.0 sphinx: 1.5.4 patsy: None dateutil: 2.6.0 pytz: 2017.2 blosc: None bottleneck: None tables: 3.3.0 numexpr: 2.6.2 feather: 0.3.1 matplotlib: 2.0.0 openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: 0.999 sqlalchemy: 1.1.8 pymysql: None psycopg2: None jinja2: 2.9.5 s3fs: None pandas_gbq: None pandas_datareader: None

About this issue

  • Original URL
  • State: open
  • Created 7 years ago
  • Comments: 17 (14 by maintainers)

Most upvoted comments

Are there any solutions to this?