anndata: AnnData cannot open file that was opened with JHDF5 before

Please make sure these conditions are met

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of anndata.
  • (optional) I have confirmed this bug exists on the master branch of anndata.

Report

I noticed that AnnData throws an error while trying to read a .h5ad file that was opened with JHDF5 (with write permissions) before. This means that AnnData is currently not able to open HDF5 files created from a Java program, even if the files conform to the AnnData on-disk format.

The problem most likely occurs because JHDF5 adds a /__DATA_TYPES__ group (presumably for internal reasons), which can be seen by using h5dump before and after the access from Java and comparing with diff. AnnData tries to read that group, but fails because the datasets stored in this group have no valid AnnData-encoding type. I guess that this problem can be circumvented by making AnnData only read groups that are part of its on-disk schema, i.e., X, layers, uns, obs[m|p], var[m|p].

Steps to reproduce

This is a minimal python program failing on the last line if the file was opened from the Java-side in between write and read from the Python-side.

import numpy as np
import anndata as ad

adata = ad.AnnData(np.zeros((2,2)))
adata.write('test.h5ad')

# There is a conditional error on this line:
# * without external interference: works
# * if opened with JHDF5 before: throws error
bdata = ad.read('test.h5ad')

Also, here is Java program causing the last line to fail. Note that it doesn’t change the file, but just opens it with write permissions.

import ch.systemsx.cisd.hdf5.HDF5Factory;

public class App {
	public static void main(String[] args) {
		HDF5Factory.open("test.h5ad");
	}
}

To get the necessary dependencies and execute the Java file, I suggest using maven with this pom.xml:

<?xml version="1.0" encoding="UTF-8"?>

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>

  <parent>
    <groupId>org.scijava</groupId>
    <artifactId>pom-scijava</artifactId>
    <version>32.0.0-beta-5</version>
    <relativePath />
  </parent>

  <groupId>mwe</groupId>
  <artifactId>mwe</artifactId>
  <version>1.0-SNAPSHOT</version>

  <properties>
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
    <maven.compiler.source>1.7</maven.compiler.source>
    <maven.compiler.target>1.7</maven.compiler.target>
  </properties>

  <dependencies>
    <dependency>
      <groupId>cisd</groupId>
      <artifactId>jhdf5</artifactId>
    </dependency>
  </dependencies>

  <repositories>
    <repository>
	  <id>scijava.public</id>
  	  <url>https://maven.scijava.org/content/groups/public</url>
	</repository>
  </repositories>

  <build>
    <pluginManagement>
      <plugins>
        <plugin>
          <groupId>org.apache.maven.plugins</groupId>
          <artifactId>maven-compiler-plugin</artifactId>
          <version>3.8.1</version>
        </plugin>
      </plugins>
    </pluginManagement>
  </build>
</project>

Traceback

Traceback (most recent call last):
  File "/home/innerbergerm@hhmi.org/Software/mambaforge/envs/anndata/lib/python3.11/site-packages/anndata/_io/utils.py", line 202, in func_wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/innerbergerm@hhmi.org/Software/mambaforge/envs/anndata/lib/python3.11/site-packages/anndata/_io/specs/registry.py", line 230, in read_elem
    read_func = self.registry.get_reader(
                ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/innerbergerm@hhmi.org/Software/mambaforge/envs/anndata/lib/python3.11/site-packages/anndata/_io/specs/registry.py", line 143, in get_reader
    raise IORegistryError._from_read_parts(
anndata._io.specs.registry.IORegistryError: No read method registered for IOSpec(encoding_type='', encoding_version='') from <class 'h5py._hl.datatype.Datatype'>. You may need to update your installation of anndata.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/innerbergerm@hhmi.org/Software/mambaforge/envs/anndata/lib/python3.11/site-packages/anndata/_io/h5ad.py", line 243, in read_h5ad
    adata = read_dispatched(f, callback=callback)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/innerbergerm@hhmi.org/Software/mambaforge/envs/anndata/lib/python3.11/site-packages/anndata/experimental/__init__.py", line 58, in read_dispatched
    return reader.read_elem(elem)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/innerbergerm@hhmi.org/Software/mambaforge/envs/anndata/lib/python3.11/site-packages/anndata/_io/utils.py", line 204, in func_wrapper
    re_raise_error(e, elem)
  File "/home/innerbergerm@hhmi.org/Software/mambaforge/envs/anndata/lib/python3.11/site-packages/anndata/_io/utils.py", line 185, in re_raise_error
    raise e
  File "/home/innerbergerm@hhmi.org/Software/mambaforge/envs/anndata/lib/python3.11/site-packages/anndata/_io/utils.py", line 202, in func_wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/innerbergerm@hhmi.org/Software/mambaforge/envs/anndata/lib/python3.11/site-packages/anndata/_io/specs/registry.py", line 235, in read_elem
    return self.callback(read_func, elem.name, elem, iospec=get_spec(elem))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/innerbergerm@hhmi.org/Software/mambaforge/envs/anndata/lib/python3.11/site-packages/anndata/_io/h5ad.py", line 224, in callback
    **{
      ^
  File "/home/innerbergerm@hhmi.org/Software/mambaforge/envs/anndata/lib/python3.11/site-packages/anndata/_io/h5ad.py", line 227, in <dictcomp>
    k: read_dispatched(elem[k], callback)
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/innerbergerm@hhmi.org/Software/mambaforge/envs/anndata/lib/python3.11/site-packages/anndata/experimental/__init__.py", line 58, in read_dispatched
    return reader.read_elem(elem)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/innerbergerm@hhmi.org/Software/mambaforge/envs/anndata/lib/python3.11/site-packages/anndata/_io/utils.py", line 204, in func_wrapper
    re_raise_error(e, elem)
  File "/home/innerbergerm@hhmi.org/Software/mambaforge/envs/anndata/lib/python3.11/site-packages/anndata/_io/utils.py", line 185, in re_raise_error
    raise e
  File "/home/innerbergerm@hhmi.org/Software/mambaforge/envs/anndata/lib/python3.11/site-packages/anndata/_io/utils.py", line 202, in func_wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/innerbergerm@hhmi.org/Software/mambaforge/envs/anndata/lib/python3.11/site-packages/anndata/_io/specs/registry.py", line 235, in read_elem
    return self.callback(read_func, elem.name, elem, iospec=get_spec(elem))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/innerbergerm@hhmi.org/Software/mambaforge/envs/anndata/lib/python3.11/site-packages/anndata/_io/h5ad.py", line 241, in callback
    return func(elem)
           ^^^^^^^^^^
  File "/home/innerbergerm@hhmi.org/Software/mambaforge/envs/anndata/lib/python3.11/site-packages/anndata/_io/specs/methods.py", line 94, in read_basic
    return {k: _reader.read_elem(v) for k, v in elem.items()}
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/innerbergerm@hhmi.org/Software/mambaforge/envs/anndata/lib/python3.11/site-packages/anndata/_io/specs/methods.py", line 94, in <dictcomp>
    return {k: _reader.read_elem(v) for k, v in elem.items()}
               ^^^^^^^^^^^^^^^^^^^^
  File "/home/innerbergerm@hhmi.org/Software/mambaforge/envs/anndata/lib/python3.11/site-packages/anndata/_io/utils.py", line 204, in func_wrapper
    re_raise_error(e, elem)
  File "/home/innerbergerm@hhmi.org/Software/mambaforge/envs/anndata/lib/python3.11/site-packages/anndata/_io/utils.py", line 188, in re_raise_error
    raise AnnDataReadError(
anndata._io.utils.AnnDataReadError: Above error raised while reading key '/__DATA_TYPES__/Enum_Boolean' of type <class 'h5py._hl.datatype.Datatype'> from /.

Versions

anndata             0.9.1
session_info        1.0.0
-----
Python 3.11.4 | packaged by conda-forge | (main, Jun 10 2023, 18:08:17) [GCC 12.2.0]
Linux-5.19.0-46-generic-x86_64-with-glibc2.35
-----
Session information updated at 2023-07-11 05:25

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 18 (9 by maintainers)

Most upvoted comments

I suppose we could just manually delete the group for the AnnData h5’s we create

It looks like some other users of this library do that:

But searching for hdf5 __DATA_TYPES__ mostly looks like people trying to work around this being added.

Thanks for the prompt feedback!

This is the file generated by the Python mwe above after accessing it with JHDF5. The difference in the output of h5dump is shown in the following.

1c1
< HDF5 "before.h5ad" {
---
> HDF5 "after.h5ad" {
57a58,70
>    }
>    GROUP "__DATA_TYPES__" {
>       DATATYPE "Enum_Boolean" H5T_ENUM {
>          H5T_STD_I8LE;
>          "FALSE"            0;
>          "TRUE"             1;
>       };
>       DATATYPE "String_VariableLength" H5T_STRING {
>          STRSIZE H5T_VARIABLE;
>          STRPAD H5T_STR_NULLTERM;
>          CSET H5T_CSET_ASCII;
>          CTYPE H5T_C_S1;
>       };

This more or less matches how you reproduced the issue. Your solution is a great drop-in replacement for reading these kinds of files, thanks a lot! It would be great, though, if this could be the default behavior in future versions to facilitate collaboration across language boundaries. Does this being in the ad.experimental namespace mean that this is already planned for a future release?

I can reproduce with:

import anndata as ad, numpy as np, h5py

a = ad.AnnData(np.ones((5, 5)))
a.write("tmp.h5ad")

with h5py.File("tmp.h5ad", "r+") as f:
    f.create_group("__DATA_TYPES__")

ad.read_h5ad("tmp.h5ad")

You can get around this right now with:

with h5py.File("tmp.h5ad", "r") as f:
    result = ad.experimental.read_elem(f)