pandas: Pandas pytables interface doesn't create empty table datasets

Pandas used to allow the writing of empty HDF5 datasets through its pytables interface code. However, after upgrading to 0.17 (from 0.11), we’ve discovered that this behaviour is intentionally short circuited. The library behaves as though the dataset is being written, but simply ignores the request and the resulting HDF5 file doesn’t contain the requested table.

The offending code is in pandas/io/pytables.py:_write_to_group()

    # we don't want to store a table node at all if are object is 0-len
    # as there are not dtypes
    if getattr(value, 'empty', None) and (format == 'table' or append):
        return

We’ve worked around it by patching our installed copy of pandas, but we’d like to know the provocation behind this code before submitting a pull request. The comment implies that the lack of dtypes in the dataset is the cause, however each pandas column has type information even if empty.

Any clarification would be appreciated

About this issue

Original URL
State: open
Created 8 years ago
Reactions: 5
Comments: 27 (12 by maintainers)

Most upvoted comments

We’re writing a data structure that can be empty. Then we’re reading the data structure in another program. The current method silently elides the existence of the table, so the reading program would have to catch an exception and fake the data structure.

damionw on Apr 28, 2016

I think I’m facing the same issue.

Here’s how to reproduce:

import pandas as pd
from pandas import HDFStore


# Prints 0.20.3
print(pd.__version__)

emptydf = pd.DataFrame({'col_1': [], 'col_2': []}, index=[])

with HDFStore("test.h5", 'w') as store:

    assert not store.keys()

    # append -> no table created
    store.append('empty', emptydf)
    assert not store.keys()

    # put, 'table' format -> no table created
    store.put('empty', emptydf, format='table')
    # No table created
    assert not store.keys()

    # put, default format -> array created
    store.put('empty', emptydf)
    assert store.keys() == ['/empty']

store.close()

My use case

I’m writing an API to store timeseries and I would like to separate creation/deletion of timeseries ID and data write/delete in a timeseries.

In other words, I want to be able to do

# Returns empty list []
list_ids()

# Raises "ID does not exist" exception
save(new_id, new_data)

# Creates new timeseries ID
create(new_id)

# Returns [new_id, ]
list_ids()

# Writes data (this time, ID exists)
save(new_id, new_data)

but I don’t know how to create an empty timeseries because it won’t be written in the file. I could allow save to auto create timeseries, but this wouldn’t solve the issue of the ID not being listed until there actually is data in it, therefore not being advertised in the list.

The only workaround I see is to maintain an ID list somewhere else, which I’d rather avoid.

lafrech on Oct 30, 2017