iris: Reading netcdf files is slow if there are unlimited dimensions

Reading arrays from NetCDF files is slow if one dimension is unlimited.

An example: reading an array of shape (5000, 50) takes ~7 s if the first dimension is unlimited. If both dimensions are fixed, it takes ~0.02 s. This is a major bottleneck if many (100s) of such files need to be processed. Time dimension is often declared unlimited in files generated by circulation models.

Test case:

import iris
import time

f = 'example_dataset.nc'
var = 'sea_water_practical_salinity'

tic = time.process_time()
cube = iris.load_cube(f, var)
cube.data
duration = time.process_time() - tic
print('Duration {:.3f} s'.format(duration))

The input NetCDF file can be generated with:

import iris
import numpy
import datetime

ntime = 5000
nz = 50
dt = 600.
time = numpy.arange(ntime, dtype=float)*dt
date_zero = datetime.datetime(2000, 1, 1)
date_epoch = datetime.datetime.utcfromtimestamp(0)
time_epoch = time + (date_zero - date_epoch).total_seconds()
z = numpy.linspace(0, 10, nz)
values = 5*numpy.sin(time/(14*24*3600.))
values = numpy.tile(values, (nz, 1)).T

time_dim = iris.coords.DimCoord(time_epoch, standard_name='time',
                                units='seconds since 1970-01-01 00:00:00-00')
z_dim = iris.coords.DimCoord(z, standard_name='depth', units='m')
cube = iris.cube.Cube(values)
cube.standard_name = 'sea_water_practical_salinity'
cube.units = '1'
cube.add_dim_coord(time_dim, 0)
cube.add_dim_coord(z_dim, 1)
iris.fileformats.netcdf.save(cube, 'example_dataset.nc',
                             unlimited_dimensions=['time'])

Profiling suggest that in the unlimited case, each time slice is being read separately, i.e. NetCDFDataProxy.__getitem__ is being called 5000 times.

Tested with: iris version 2.2.0, Anaconda3 2019.03

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Reactions: 1
  • Comments: 32 (20 by maintainers)

Commits related to this issue

Most upvoted comments

Thanks @bjlittle for looking into this. I confirm that the performance hit is caused by chunking. Setting chunks=None in netcdf._get_cf_var_data fixes the issue.

Modifying the input file chunking is not an option for me, as I get these files from an external model and I do not wish to convert them. Iris should be able to read any netCDF file with reasonable performance.

Having the chunk options in the iris load_cube API would be useful. This would fix my issue.

Evidence for #3361 fixing the original problem here : see this

I wish it would divide earlier dimensions preferentially over later ones (like the existing Iris code). Instead, it aims to divide all dimension equally,

I think dask.array.core.normalize_chunks is intended as a general utility, not just for chunking files or serialised formats (most significantly, it is also used in rechunking dask arrays). I think one could easily argue for several different behaviours:

  • maintain original chunk proportions - sensible if you expect the access patterns that led to the current array chunks to continue in the future and just want to reduce their fragmentation below certain threshold
  • try to keep the chunks roughly square - sensible if access patterns along different dimensions are unpredictable
  • try to keep inner dimensions continuous in preference to outer ones - sensible if you want to optimise for sequential reading, like from disk or tape

I don’t think there is a perfect choice here, it all depends on the use case.

I might feedback a proposal to dask on this, if I can see a clear purpose for the ‘better’ way.

Perhaps dask could have an option like, say, chunking_strategy='propotional' or 'equal' or 'serialized' or custom_function? That way one-fits-all solution would not be required.

But based on dask.array.from_array docs, dask already respects chunk boundaries if the underlying object has chunks attribute. Iris could add this to its NetCDFDataProxy object and get the right behaviour for free.

Here’s how this looks in practice:

class PseudoArray:
    def __init__(self, shape, dtype, chunks):
        self.shape = shape
        self.dtype = dtype
        self.chunks = chunks
shape = (120, 12, 1000, 1000)
netcdf_chunks = (16, 2, 32, 32)
a_proxy = PseudoArray(shape, dtype='float64', chunks=netcdf_chunks)
result = da.from_array(a_proxy, chunks='50 MiB')
print('netcdf chunks:', da.core.normalize_chunks(a_proxy.chunks, a_proxy.shape, a_proxy.dtype))
print('dask chunks:', result.chunks)

Output: netcdf chunks: ((16, 16, 16, 16, 16, 16, 16, 8), (2, 2, 2, 2, 2, 2), (32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 8), (32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 8)) dask chunks: ((48, 48, 24), (6, 6), (96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 40), (96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 40))

Edit:

print('dask default chunks (128 MiB):', da.from_array(a_proxy).chunks)

Output: dask default chunks (128 MiB): ((64, 56), (8, 4), (128, 128, 128, 128, 128, 128, 128, 104), (128, 128, 128, 128, 128, 128, 128, 104))

Edit2:

And this is what happens if there’s no chunks exposed (note how this will trash the original chunk boundaries):

print('dask bad chunks:', da.from_array(PseudoArray(shape, 'float64', None)).chunks)

Output: dask bad chunks: ((60, 60), (12,), (100, 100, 100, 100, 100, 100, 100, 100, 100, 100), (100, 100, 100, 100, 100, 100, 100, 100, 100, 100))

@TomekTrzeciak personally I would vote to do the dumb thing … load each file in one chunk

Unfortunately loading a variable as a single chunk is the one thing we must avoid, if the chunk is simply unmanageably large. That is the one thing that the current strategy was designed to solve.

If we currently decide that a chunk is too large + divide it up, I think in future we can reasonably decide that a chunk is too small and combine a few. The logic is essentially the same. However, the undesirable effect of unlimited dimensions, in that it mimics user-selected variable chunking, is a bit of a shock. Maybe Dask itself has already engineered ways around that problem. I’m not sure yet how much we can handily delegate decisions to dask.

@bjlittle @pp-mo I do agree that the performance hit is partially in iris itself, and related to the opening/closing of the netCDF file. This happens in NetCDFDataProxy.__getitem__ which in my worst-case example is called 5000 times. As demonstrated, the chunking choice does affect the reading pattern though.

@tkarna I think that there should be a happy compromise between:

  1. iris honouring the chunking specified in the netcdf file
  2. iris overriding the chunking specified in the netcdf file that is considered too small and will result in a performance hit
  3. allowing the user to override iris by specifying the chunks at the API level

iris already performs step 1. above, as we know. Extending iris to support steps 2. and 3. should cover most other use cases, and give enough wiggle room for users to circumvent most chunking issues. Like you said, you’re a third-party to the data and can’t change the native netcdf chunking of variables in the file, so iris needs to help you easily work around that.

@bjlittle, personally I would vote to do the dumb thing here and either load each file in one chunk or rely on dask to make the chunking choice unless overridden by an explicit option. This leads to much more predictable behaviour in the long run. You can then allow users to explicitly set chunks=... in iris on per call basis or with context manager or globally.

Also, I think it’s preferable to address any deficiencies in chunking policy upstream in dask, so that iris doesn’t have to interfere too much. But based on dask.array.from_array docs, dask already respects chunk boundaries if the underlying object has chunks attribute. Iris could add this to its NetCDFDataProxy object and get the right behaviour for free.

@bjlittle Yes, that sounds very good indeed.

@tkarna I think that there should be a happy compromise between:

  1. iris honouring the chunking specified in the netcdf file
  2. iris overriding the chunking specified in the netcdf file that is considered too small and will result in a performance hit
  3. allowing the user to override iris by specifying the chunks at the API level

iris already performs step 1. above, as we know. Extending iris to support steps 2. and 3. should cover most other use cases, and give enough wiggle room for users to circumvent most chunking issues. Like you said, you’re a third-party to the data and can’t change the native netcdf chunking of variables in the file, so iris needs to help you easily work around that.

Another option, perhaps, would be to define better default chunking scheme. For example, if the file chunk only contains < N values, one would revert to chunks=None as (I’d imagine) reading in tiny chunks is never a good idea. This would also be easier for the user as one wouldn’t need to tweak with the chunk options.

For what it’s worth I see the same performance with version 2.2.1 and the master branch as well.

I’m using version 2.2.0 installed with conda.