h5py: Parallel write fails with Can't decrement id ref count error

Writing H5py under certain conditions fails with the following trace back(relative path truncated):

 Traceback (most recent call last):
  File "h5py/_objects.pyx", line 54, in h5p
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper (/lustre/atlas2/ard116/proj-shared/Downloads/titan/h5py-2.5.0/h5py/_objects.c:2716)
  File ".../h5py-2.5.0-py3.4-linux-x86_64.egg/h5py/_hl/files.py", line 306, in __exit__
    self.close()
  File ".../h5py-2.5.0-py3.4-linux-x86_64.egg/h5py/_hl/files.py", line 288, in close
    h5i.dec_ref(id_)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper (/lustre/atlas2/ard116/proj-shared/Downloads/titan/h5py-2.5.0/h5py/_objects.c:2759)
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper (/lustre/atlas2/ard116/proj-shared/Downloads/titan/h5py-2.5.0/h5py/_objects.c:2716)
  File "h5py/h5i.pyx", line 150, in h5py.h5i.dec_ref (/lustre/atlas2/ard116/proj-shared/Downloads/titan/h5py-2.5.0/h5py/h5i.c:2339)
RuntimeError: **Can't decrement id ref count** (Other i/o error , error stack:
adioi_gen_close(120): other i/o error input/output error)
Rank 72 [Thu Mar 31 13:06:17 2016] [c17-6c0s3n0] application called MPI_Abort(MPI_COMM_WORLD, 1) - process 72
.../pytools-2016.1-py3.4.egg/pytools/prefork.py:74: UserWarning: Prefork server exiting upon
  warn("%s exiting upon apparent death of %s" % (who, partner))
_pmiu_daemon(SIGCHLD): [NID 12614] [c17-6c0s3n0] [Thu Mar 31 13:06:18 2016] PE RANK 72 exit signal Aborted
[NID 12614] 2016-03-31 13:06:18 Apid 10570897: initiated application termination

The conditions when this was found to happen are listed below

H5Py Parallel write with MPIO
MPI rank is about 5000
Huge file writes each file is roughly 1.5TB in size, 5000 separate data sets
Not always, we have observed about 1 failure in two/three writes
File is unlocked and can be opened but has some junk data/zero
Not reproducible in smaller files/fewer rank

It seems to be some race condition in decrementing the reference count . This bug may be related to the

https://github.com/h5py/h5py/issues/495 which gave a similar error but that bug was fixed later.

About this issue

Original URL
State: open
Created 8 years ago
Comments: 22 (2 by maintainers)

Most upvoted comments

Thanks y’all, using the server local tmp (file:/tmp/) and moving it to the cluster file system dbfs:/ works well !

hthouse on Feb 5, 2020

@PabloAMC There are two important error messages in your traceback — 'Software caused connection abort', 'Transport endpoint is not connected' — that indicate the problem is likely with the backend storage system and not h5py or libhdf5.

ajelenak on Jan 15, 2023

Seems there are 2 work arounds as @danzafar suggests:

Use MLFlow
Create the model in tmp folder and move to databricks dbfs folder later.

nareshr8 on Feb 4, 2020

@nareshr8 , when working with Databricks the FUSE mount that underlies how DBFS has some major issues with the Keras save() function. If you want to use regular keras the workflow needed to save to dbfs is to first save to a local file, like /tmp/your_model.h5, and then move that over to dbfs using %fs magic command.

The other option is to simply use tf.keras instead of keras or MLflow and don’t use h5py. That will play nicer with the FUSE mount to dbfs.

danzafar on Feb 10, 2020

I am seeing this issue using a multi-node cluster on Azure pushing to a mount. The code gives this error many times:

File "h5py/_objects.pyx", line 193, in h5py._objects.ObjectID.__dealloc__
RuntimeError: Can't decrement id ref count (file write failed: time = Fri Dec 20 14:10:16 2019
, filename = '/dbfs/mnt/datalake_read_write/.../keras_model_v1.h5', file descriptor = 13, errno = 95, error message = 'Operation not supported', buf = 0x55c025877580, total write size = 4, bytes this sub-write = 4, bytes actually written = 18446744073709551615, offset = 103696)

before failing with:

RuntimeError: Unable to flush file's cached information

Any thoughts @aragilar?

danzafar on Dec 20, 2019