pandas: read_hdf throws UnicodeDecodeError with Python 3.5 and 3.6 but not with Python 2.7

Code Sample, a copy-pastable example if possible

import pandas as pd
df = pd.read_hdf('data.h5')

Problem description

The HDF5 dataset was created with pandas, to_hdf in Python 2.7 and can be read in by Python 2.7. When I try to read it in with Python 3.5 or Python 3.6, I get the following:

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-2-53006689fd2c> in <module>()
----> 1 df = pd.read_hdf(data.h5')

/home/tct/anaconda2/envs/py36/lib/python3.6/site-packages/pandas/io/pytables.py in read_hdf(path_or_buf, key, **kwargs)
    356                                      'contains multiple datasets.')
    357             key = candidate_only_group._v_pathname
--> 358         return store.select(key, auto_close=auto_close, **kwargs)
    359     except:
    360         # if there is an error, close the store

/home/tct/anaconda2/envs/py36/lib/python3.6/site-packages/pandas/io/pytables.py in select(self, key, where, start, stop, columns, iterator, chunksize, auto_close, **kwargs)
    720                            chunksize=chunksize, auto_close=auto_close)
    721 
--> 722         return it.get_result()
    723 
    724     def select_as_coordinates(

/home/tct/anaconda2/envs/py36/lib/python3.6/site-packages/pandas/io/pytables.py in get_result(self, coordinates)
   1426 
   1427         # directly return the result
-> 1428         results = self.func(self.start, self.stop, where)
   1429         self.close()
   1430         return results

/home/tct/anaconda2/envs/py36/lib/python3.6/site-packages/pandas/io/pytables.py in func(_start, _stop, _where)
    713             return s.read(start=_start, stop=_stop,
    714                           where=_where,
--> 715                           columns=columns, **kwargs)
    716 
    717         # create the iterator

/home/tct/anaconda2/envs/py36/lib/python3.6/site-packages/pandas/io/pytables.py in read(self, start, stop, **kwargs)
   2864             blk_items = self.read_index('block%d_items' % i)
   2865             values = self.read_array('block%d_values' % i,
-> 2866                                      start=_start, stop=_stop)
   2867             blk = make_block(values,
   2868                              placement=items.get_indexer(blk_items))

/home/tct/anaconda2/envs/py36/lib/python3.6/site-packages/pandas/io/pytables.py in read_array(self, key, start, stop)
   2413         import tables
   2414         node = getattr(self.group, key)
-> 2415         data = node[start:stop]
   2416         attrs = node._v_attrs
   2417 

/home/tct/anaconda2/envs/py36/lib/python3.6/site-packages/tables/vlarray.py in __getitem__(self, key)
    673             start, stop, step = self._process_range(
    674                 key.start, key.stop, key.step)
--> 675             return self.read(start, stop, step)
    676         # Try with a boolean or point selection
    677         elif type(key) in (list, tuple) or isinstance(key, numpy.ndarray):

/home/tct/anaconda2/envs/py36/lib/python3.6/site-packages/tables/vlarray.py in read(self, start, stop, step)
    813         atom = self.atom
    814         if not hasattr(atom, 'size'):  # it is a pseudo-atom
--> 815             outlistarr = [atom.fromarray(arr) for arr in listarr]
    816         else:
    817             # Convert the list to the right flavor

/home/tct/anaconda2/envs/py36/lib/python3.6/site-packages/tables/vlarray.py in <listcomp>(.0)
    813         atom = self.atom
    814         if not hasattr(atom, 'size'):  # it is a pseudo-atom
--> 815             outlistarr = [atom.fromarray(arr) for arr in listarr]
    816         else:
    817             # Convert the list to the right flavor

/home/tct/anaconda2/envs/py36/lib/python3.6/site-packages/tables/atom.py in fromarray(self, array)
   1226         if array.size == 0:
   1227             return None
-> 1228         return six.moves.cPickle.loads(array.tostring())

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 23: ordinal not in range(128)

Note: We receive a lot of issues on our GitHub tracker, so it is very possible that your issue has been posted before. Please check first before submitting so that we do not have to handle and close duplicates!

Note: Many problems can be resolved by simply upgrading pandas to the latest version. Before submitting, please check if that solution works for you. If possible, you may want to check if master addresses this issue, but that is not necessary.

For documentation-related issues, you can check the latest versions of the docs on master here:

https://pandas-docs.github.io/pandas-docs-travis/

If the issue has not been resolved there, go ahead and file it in the issue tracker.

Expected Output

In [1]: import pandas as pd
In [2]: df = pd.read_hdf('data.h5')

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line]

INSTALLED VERSIONS
------------------
commit: None
python: 3.6.1.final.0
python-bits: 64
OS: Linux
OS-release: 4.9.0-3-amd64
machine: x86_64
processor: 
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.20.1
pytest: 3.0.7
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.25.2
numpy: 1.12.1
scipy: 0.19.0
xarray: None
IPython: 5.3.0
sphinx: 1.5.6
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: 1.2.1
tables: 3.3.0
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: 2.4.7
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 3.7.3
bs4: 4.6.0
html5lib: 0.999
sqlalchemy: 1.1.9
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None
INSTALLED VERSIONS
------------------
commit: None
python: 2.7.13.final.0
python-bits: 64
OS: Linux
OS-release: 4.9.0-3-amd64
machine: x86_64
processor: 
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None

pandas: 0.20.1
pytest: 3.0.7
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.25.2
numpy: 1.11.2
scipy: 0.18.1
xarray: None
IPython: 5.3.0
sphinx: 1.5.6
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: 1.2.1
tables: 3.3.0
numexpr: 2.6.1
feather: None
matplotlib: 2.0.2
openpyxl: 2.4.7
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 3.7.3
bs4: 4.6.0
html5lib: 0.999
sqlalchemy: 1.1.9
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Reactions: 3
  • Comments: 17 (7 by maintainers)

Most upvoted comments

Just a postscript. format='table' only works for a single column of data. When trying to save the entire dataset in Python 2.7,

TypeError: Cannot serialize the column [task_list] because
its data contents are [unicode] object dtype

when saving using encoding='utf-8' the file is saved but again cannot be read in 3.x. TypeError: lookup() argument must be str, not numpy.bytes_