iris: Reading netcdf files is slow if there are unlimited dimensions
Reading arrays from NetCDF files is slow if one dimension is unlimited.
An example: reading an array of shape (5000, 50)
takes ~7 s if the first dimension is unlimited. If both dimensions are fixed, it takes ~0.02 s. This is a major bottleneck if many (100s) of such files need to be processed. Time dimension is often declared unlimited in files generated by circulation models.
Test case:
import iris
import time
f = 'example_dataset.nc'
var = 'sea_water_practical_salinity'
tic = time.process_time()
cube = iris.load_cube(f, var)
cube.data
duration = time.process_time() - tic
print('Duration {:.3f} s'.format(duration))
The input NetCDF file can be generated with:
import iris
import numpy
import datetime
ntime = 5000
nz = 50
dt = 600.
time = numpy.arange(ntime, dtype=float)*dt
date_zero = datetime.datetime(2000, 1, 1)
date_epoch = datetime.datetime.utcfromtimestamp(0)
time_epoch = time + (date_zero - date_epoch).total_seconds()
z = numpy.linspace(0, 10, nz)
values = 5*numpy.sin(time/(14*24*3600.))
values = numpy.tile(values, (nz, 1)).T
time_dim = iris.coords.DimCoord(time_epoch, standard_name='time',
units='seconds since 1970-01-01 00:00:00-00')
z_dim = iris.coords.DimCoord(z, standard_name='depth', units='m')
cube = iris.cube.Cube(values)
cube.standard_name = 'sea_water_practical_salinity'
cube.units = '1'
cube.add_dim_coord(time_dim, 0)
cube.add_dim_coord(z_dim, 1)
iris.fileformats.netcdf.save(cube, 'example_dataset.nc',
unlimited_dimensions=['time'])
Profiling suggest that in the unlimited case, each time slice is being read separately, i.e. NetCDFDataProxy.__getitem__
is being called 5000 times.
Tested with: iris version 2.2.0, Anaconda3 2019.03
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Reactions: 1
- Comments: 32 (20 by maintainers)
Thanks @bjlittle for looking into this. I confirm that the performance hit is caused by chunking. Setting
chunks=None
innetcdf._get_cf_var_data
fixes the issue.Modifying the input file chunking is not an option for me, as I get these files from an external model and I do not wish to convert them. Iris should be able to read any netCDF file with reasonable performance.
Having the chunk options in the
iris
load_cube
API would be useful. This would fix my issue.Evidence for #3361 fixing the original problem here : see this
I think
dask.array.core.normalize_chunks
is intended as a general utility, not just for chunking files or serialised formats (most significantly, it is also used in rechunking dask arrays). I think one could easily argue for several different behaviours:I don’t think there is a perfect choice here, it all depends on the use case.
Perhaps dask could have an option like, say,
chunking_strategy='propotional' or 'equal' or 'serialized' or custom_function
? That way one-fits-all solution would not be required.Here’s how this looks in practice:
Output: netcdf chunks: ((16, 16, 16, 16, 16, 16, 16, 8), (2, 2, 2, 2, 2, 2), (32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 8), (32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 8)) dask chunks: ((48, 48, 24), (6, 6), (96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 40), (96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 40))
Edit:
Output: dask default chunks (128 MiB): ((64, 56), (8, 4), (128, 128, 128, 128, 128, 128, 128, 104), (128, 128, 128, 128, 128, 128, 128, 104))
Edit2:
And this is what happens if there’s no
chunks
exposed (note how this will trash the original chunk boundaries):Output: dask bad chunks: ((60, 60), (12,), (100, 100, 100, 100, 100, 100, 100, 100, 100, 100), (100, 100, 100, 100, 100, 100, 100, 100, 100, 100))
Unfortunately loading a variable as a single chunk is the one thing we must avoid, if the chunk is simply unmanageably large. That is the one thing that the current strategy was designed to solve.
If we currently decide that a chunk is too large + divide it up, I think in future we can reasonably decide that a chunk is too small and combine a few. The logic is essentially the same. However, the undesirable effect of unlimited dimensions, in that it mimics user-selected variable chunking, is a bit of a shock. Maybe Dask itself has already engineered ways around that problem. I’m not sure yet how much we can handily delegate decisions to dask.
@bjlittle @pp-mo I do agree that the performance hit is partially in
iris
itself, and related to the opening/closing of the netCDF file. This happens inNetCDFDataProxy.__getitem__
which in my worst-case example is called 5000 times. As demonstrated, the chunking choice does affect the reading pattern though.@bjlittle, personally I would vote to do the dumb thing here and either load each file in one chunk or rely on dask to make the chunking choice unless overridden by an explicit option. This leads to much more predictable behaviour in the long run. You can then allow users to explicitly set
chunks=...
in iris on per call basis or with context manager or globally.Also, I think it’s preferable to address any deficiencies in chunking policy upstream in dask, so that iris doesn’t have to interfere too much. But based on dask.array.from_array docs, dask already respects chunk boundaries if the underlying object has
chunks
attribute. Iris could add this to itsNetCDFDataProxy
object and get the right behaviour for free.@bjlittle Yes, that sounds very good indeed.
@tkarna I think that there should be a happy compromise between:
iris
honouring the chunking specified in the netcdf fileiris
overriding the chunking specified in the netcdf file that is considered too small and will result in a performance hitiris
by specifying the chunks at the API leveliris
already performs step1.
above, as we know. Extendingiris
to support steps2.
and3.
should cover most other use cases, and give enough wiggle room for users to circumvent most chunking issues. Like you said, you’re a third-party to the data and can’t change the native netcdf chunking of variables in the file, soiris
needs to help you easily work around that.Another option, perhaps, would be to define better default chunking scheme. For example, if the file chunk only contains < N values, one would revert to
chunks=None
as (I’d imagine) reading in tiny chunks is never a good idea. This would also be easier for the user as one wouldn’t need to tweak with the chunk options.For what it’s worth I see the same performance with version
2.2.1
and themaster
branch as well.I’m using version
2.2.0
installed withconda
.