h5py: Parallel write fails with Can't decrement id ref count error

Writing H5py under certain conditions fails with the following trace back(relative path truncated):

 Traceback (most recent call last):
  File "h5py/_objects.pyx", line 54, in h5p
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper (/lustre/atlas2/ard116/proj-shared/Downloads/titan/h5py-2.5.0/h5py/_objects.c:2716)
  File ".../h5py-2.5.0-py3.4-linux-x86_64.egg/h5py/_hl/files.py", line 306, in __exit__
    self.close()
  File ".../h5py-2.5.0-py3.4-linux-x86_64.egg/h5py/_hl/files.py", line 288, in close
    h5i.dec_ref(id_)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper (/lustre/atlas2/ard116/proj-shared/Downloads/titan/h5py-2.5.0/h5py/_objects.c:2759)
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper (/lustre/atlas2/ard116/proj-shared/Downloads/titan/h5py-2.5.0/h5py/_objects.c:2716)
  File "h5py/h5i.pyx", line 150, in h5py.h5i.dec_ref (/lustre/atlas2/ard116/proj-shared/Downloads/titan/h5py-2.5.0/h5py/h5i.c:2339)
RuntimeError: **Can't decrement id ref count** (Other i/o error , error stack:
adioi_gen_close(120): other i/o error input/output error)
Rank 72 [Thu Mar 31 13:06:17 2016] [c17-6c0s3n0] application called MPI_Abort(MPI_COMM_WORLD, 1) - process 72
.../pytools-2016.1-py3.4.egg/pytools/prefork.py:74: UserWarning: Prefork server exiting upon
  warn("%s exiting upon apparent death of %s" % (who, partner))
_pmiu_daemon(SIGCHLD): [NID 12614] [c17-6c0s3n0] [Thu Mar 31 13:06:18 2016] PE RANK 72 exit signal Aborted
[NID 12614] 2016-03-31 13:06:18 Apid 10570897: initiated application termination

The conditions when this was found to happen are listed below

  1. H5Py Parallel write with MPIO
  2. MPI rank is about 5000
  3. Huge file writes each file is roughly 1.5TB in size, 5000 separate data sets
  4. Not always, we have observed about 1 failure in two/three writes
  5. File is unlocked and can be opened but has some junk data/zero
  6. Not reproducible in smaller files/fewer rank

It seems to be some race condition in decrementing the reference count . This bug may be related to the

https://github.com/h5py/h5py/issues/495 which gave a similar error but that bug was fixed later.

About this issue

  • Original URL
  • State: open
  • Created 8 years ago
  • Comments: 22 (2 by maintainers)

Most upvoted comments

Thanks y’all, using the server local tmp (file:/tmp/) and moving it to the cluster file system dbfs:/ works well !

@PabloAMC There are two important error messages in your traceback — 'Software caused connection abort', 'Transport endpoint is not connected' — that indicate the problem is likely with the backend storage system and not h5py or libhdf5.

Seems there are 2 work arounds as @danzafar suggests:

  • Use MLFlow

  • Create the model in tmp folder and move to databricks dbfs folder later.

@nareshr8 , when working with Databricks the FUSE mount that underlies how DBFS has some major issues with the Keras save() function. If you want to use regular keras the workflow needed to save to dbfs is to first save to a local file, like /tmp/your_model.h5, and then move that over to dbfs using %fs magic command.

The other option is to simply use tf.keras instead of keras or MLflow and don’t use h5py. That will play nicer with the FUSE mount to dbfs.

I am seeing this issue using a multi-node cluster on Azure pushing to a mount. The code gives this error many times:

File "h5py/_objects.pyx", line 193, in h5py._objects.ObjectID.__dealloc__
RuntimeError: Can't decrement id ref count (file write failed: time = Fri Dec 20 14:10:16 2019
, filename = '/dbfs/mnt/datalake_read_write/.../keras_model_v1.h5', file descriptor = 13, errno = 95, error message = 'Operation not supported', buf = 0x55c025877580, total write size = 4, bytes this sub-write = 4, bytes actually written = 18446744073709551615, offset = 103696)

before failing with:

RuntimeError: Unable to flush file's cached information

Any thoughts @aragilar?