astropy: crash/bug when reading table meta-data containing object array written with serialize_meta=True to HDF5 file

Description

Please see the code below that crashes unexpectedly after the first run, i.e. typically at the second run. I’m puzzled by the behavior. Note that the code doesn’t crash if a.) bands would be a byte string array, b.) serialize_meta was False or c.) a table was written where serialize_meta was True before reading the table (that’s why the first run doesn’t crash).

I’m sorry if what I’m doing is not supported. In any case, that the code just crashes with a segmentation fault is puzzling. I’ve run the code on 2 machines and 3 different virtual environments and they all crashed eventually.

Expected behavior

No crash and consistent output, i.e. OrderedDict([('bands', array(['g', 'r', 'i'], dtype=object))]), when running the code repeatedly.

How to Reproduce

I ran the following code in a virtual environment on Fedora 38 with python 3.11 and packages installed via pip. Here’s the minimal requirements.txt.

astropy==5.2.2 h5py==3.8.0 numpy==1.24.3 packaging==23.1 pyerfa==2.0.0.3 PyYAML==6.0

The following code crashes after running the script a second time. In other words, if I run the script once, it works fine and outputs the expected result. But when running it again, it typically crashes with a segmentation fault. Sometimes, the second and later run give a wrong result, i.e. OrderedDict([(‘bands’, array([‘R’, ‘r’, ‘i’], dtype=object))]).

from astropy.table import Table
import numpy as np
from pathlib import Path

table = Table(np.arange(5))
table.meta['bands'] = np.array(['g', 'r', 'i'], dtype=object)

path = Path('test.hdf5')
if not path.exists():
    table.write(path, serialize_meta=True)
print(Table.read(path).meta)

Versions

Linux-6.2.14-300.fc38.x86_64-x86_64-with-glibc2.37 Python 3.11.3 (main, Apr 5 2023, 00:00:00) [GCC 13.0.1 20230401 (Red Hat 13.0.1-0)] astropy 5.2.2 Numpy 1.24.3 pyerfa 2.0.0.3

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 18 (11 by maintainers)

Most upvoted comments

Ok, since the beginning I was wondering how it’s possible to dump objects. It seems to work:

❯ ipython 

In [1]: import numpy as np
   ...: from astropy.io.misc import yaml

In [2]: a = np.array(['g', 'r', 'i'], dtype=object)

In [3]: yaml.dump(a)
Out[3]: '!numpy.ndarray\nbuffer: !!binary |\n  OERnRXg3MVZBQUFBaGdESHZWVUFBT0NKQWNlOVZRQUE=\ndtype: object\norder: C\nshape: !!python/tuple [3]\n'

In [4]: yaml.load(yaml.dump(a))
Out[4]: array(['g', 'r', 'i'], dtype=object)

But this must be because the binary buffer above contain only the memory address of the object been dumped. So once you exit the Python session (or run the test script in a new shell) and try to reload it, it crashes, because the memory address no longer exists (or actually points to something else):

❯ ipython

In [1]: import numpy as np
   ...: from astropy.io.misc import yaml

In [2]: yaml.load('!numpy.ndarray\nbuffer: !!binary |\n  OERnRXg3MVZBQUFBaGdESHZWVUFBT0NKQWNlOVZRQUE=\ndtype: object\norder: C\nshape: !!python/tuple [3]\n')
Out[2]: [1]    301404 segmentation fault (core dumped)  ipython

So the yaml dumper should not allow to dump object arrays.

@saimn - brilliant!!

So I think the fix is to check for an object dtype in these places: https://github.com/astropy/astropy/blob/d85ce4bb3e8961660857b21618cacb80c267ca91/astropy/io/misc/yaml.py#L112-L113 https://github.com/astropy/astropy/blob/d85ce4bb3e8961660857b21618cacb80c267ca91/astropy/io/misc/yaml.py#L134

For @johannesulf unfortunately this will cause the original code to raise an exception saying (basically) “don’t try to serialize an object array”.

I’m a bit busy with other priorities right now, but this shouldn’t be too hard of a fix.

@pllim Weirdly enough, it doesn’t crash when running it this way. It always correctly prints the meta-information.