xarray: Quadratic slowdown when saving multiple datasets to the same h5 file (h5netcdf)

I can’t quite understand what’s wrong with my side of the code, wondering if this kind of slowdown is expected or not?

Basically, what I’m doing is something like this:

with h5py.File('file.h5', 'w') as f:
    f.flush()  # reset the file
for i, ds in enumerate(datasets):
    ds.to_netcdf('file.h5', group=str(i), engine='h5netcdf', mode='a')

And here’s the log for saving 20 datasets, the listed times are for each dataset independently. Instead of the expected 10 sec (which is already kind of slow, but whatever), I get 2 minutes. The time to save each dataset seems to increase linearly, which leads to a quadratic overall slowdown:

saving dataset... 00:00:00.559135
saving dataset... 00:00:00.924617
saving dataset... 00:00:01.351670
saving dataset... 00:00:01.818111
saving dataset... 00:00:02.356307
saving dataset... 00:00:02.971077
saving dataset... 00:00:03.685565
saving dataset... 00:00:04.375104
saving dataset... 00:00:04.575837
saving dataset... 00:00:05.179975
saving dataset... 00:00:05.793876
saving dataset... 00:00:06.517916
saving dataset... 00:00:07.190257
saving dataset... 00:00:07.993795
saving dataset... 00:00:08.786421
saving dataset... 00:00:09.414821
saving dataset... 00:00:10.729006
saving dataset... 00:00:11.584044
saving dataset... 00:00:14.160655
saving dataset... 00:00:14.460564

CPU times: user 1min 49s, sys: 12.8 s, total: 2min 2s
Wall time: 2min 4s

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 24 (14 by maintainers)

Most upvoted comments

Well done, Kai!

FYI: h5netcdf has just merged a refactor of the dimension scale handling, which greatly improves the performance here. It will be released in the next version (0.13.0).

See https://github.com/h5netcdf/h5netcdf/pull/112

I’ll come back if the release is out, so we can close this issue.

I suspect this could be solved by adding an optimization into h5netcdf to only call _attach_dim_scales() (and maybe some other methods) on variables/groups that have been modified (as opposed to the entire file).

It’s probably worth moving the discussion over into the h5netcdf tracker, anyways 😃

Here’s the minimal example, try running this:

import time
import xarray as xr
import numpy as np
import h5py

arr = xr.DataArray(np.random.RandomState(0).randint(-100, 100, (50_000, 3)), dims=['x', 'y'])
ds = xr.Dataset({'arr': arr})

filename = 'test.h5'
save = lambda group: ds.to_netcdf(filename, engine='h5netcdf', mode='a', group=str(group))

with h5py.File(filename, 'w') as _:
    pass

for i in range(250):
    t0 = time.time()
    save(i)
    print(time.time() - t0)