xarray: slow performance with open_mfdataset

We have a dataset stored across multiple netCDF files. We are getting very slow performance with open_mfdataset, and I would like to improve this.

Each individual netCDF file looks like this:

%time ds_single = xr.open_dataset('float_trajectories.0000000000.nc')
ds_single

CPU times: user 14.9 ms, sys: 48.4 ms, total: 63.4 ms
Wall time: 60.8 ms

<xarray.Dataset>
Dimensions:  (npart: 8192000, time: 1)
Coordinates:
  * time     (time) datetime64[ns] 1993-01-01
  * npart    (npart) int32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ...
Data variables:
    z        (time, npart) float32 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 ...
    vort     (time, npart) float32 -9.71733e-10 -9.72858e-10 -9.73001e-10 ...
    u        (time, npart) float32 0.000545563 0.000544884 0.000544204 ...
    v        (time, npart) float32 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ...
    x        (time, npart) float32 180.016 180.047 180.078 180.109 180.141 ...
    y        (time, npart) float32 -79.9844 -79.9844 -79.9844 -79.9844 ...

As shown above, a single data file opens in ~60 ms.

When I call open_mdsdataset on 49 files (each with a different time dimension but the same npart), here is what happens:

%time ds = xr.open_mfdataset('*.nc', )
ds

CPU times: user 1min 31s, sys: 25.4 s, total: 1min 57s
Wall time: 2min 4s

<xarray.Dataset>
Dimensions:  (npart: 8192000, time: 49)
Coordinates:
  * npart    (npart) int64 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ...
  * time     (time) datetime64[ns] 1993-01-01 1993-01-02 1993-01-03 ...
Data variables:
    z        (time, npart) float64 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 ...
    vort     (time, npart) float64 -9.717e-10 -9.729e-10 -9.73e-10 -9.73e-10 ...
    u        (time, npart) float64 0.0005456 0.0005449 0.0005442 0.0005437 ...
    v        (time, npart) float64 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ...
    x        (time, npart) float64 180.0 180.0 180.1 180.1 180.1 180.2 180.2 ...
    y        (time, npart) float64 -79.98 -79.98 -79.98 -79.98 -79.98 -79.98 ...

It takes over 2 minutes to open the dataset. Specifying concat_dim='time' does not improve performance.

Here is %prun of the open_mfdataset command.

         748994 function calls (724222 primitive calls) in 142.160 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       49   62.455    1.275   62.458    1.275 {method 'get_indexer' of 'pandas.index.IndexEngine' objects}
       49   47.207    0.963   47.209    0.963 base.py:1067(is_unique)
      196    7.198    0.037    7.267    0.037 {operator.getitem}
       49    4.632    0.095    4.687    0.096 netCDF4_.py:182(_open_netcdf4_group)
      240    3.189    0.013    3.426    0.014 numeric.py:2476(array_equal)
       98    1.937    0.020    1.937    0.020 {numpy.core.multiarray.arange}
4175/3146    1.867    0.000    9.296    0.003 {numpy.core.multiarray.array}
       49    1.525    0.031  119.144    2.432 alignment.py:251(reindex_variables)
       24    1.065    0.044    1.065    0.044 {method 'cumsum' of 'numpy.ndarray' objects}
       12    1.010    0.084    1.010    0.084 {method 'sort' of 'numpy.ndarray' objects}
5227/4035    0.660    0.000    1.688    0.000 collections.py:50(__init__)
       12    0.600    0.050    3.238    0.270 core.py:2761(insert)
12691/7497    0.473    0.000    0.875    0.000 indexing.py:363(shape)
   110728    0.425    0.000    0.663    0.000 {isinstance}
       12    0.413    0.034    0.413    0.034 {method 'flatten' of 'numpy.ndarray' objects}
       12    0.341    0.028    0.341    0.028 {numpy.core.multiarray.where}
        2    0.333    0.166    0.333    0.166 {pandas._join.outer_join_indexer_int64}
        1    0.331    0.331  142.164  142.164 <string>:1(<module>)

It looks like most of the time is being spent on reindex_variables. I understand why this happens…xarray needs to make sure the dimensions are the same in order to concatenate them together.

Is there any obvious way I could improve the load time? For example, can I give a hint to xarray that this reindex_variables step is not necessary, since I know that all the npart dimensions are the same in each file?

Possibly related to #1301 and #1340.

About this issue

Original URL
State: open
Created 7 years ago
Reactions: 1
Comments: 52 (27 by maintainers)

Most upvoted comments

In your twitter thread you said

Do any of my xarray/dask folks know why open_mfdataset takes such a significant amount of time compared to looping over a list of files? Each file corresponds to a new time, just wanting to open multiple times at once…

The general reason for this is usually that open_mfdataset performs coordinate compatibility checks when it concatenates the files. It’s useful to actually read the code of open_mfdataset to see how it works.

First, all the files are opened individually https://github.com/pydata/xarray/blob/577d3a75ea8bb25b99f9d31af8da14210cddff78/xarray/backends/api.py#L900-L903

You can recreate this step outside of xarray yourself by doing something like

from glob import glob
datasets = [xr.open_dataset(fname, chunks={}) for fname in glob('*.nc')]

Once each dataset is open, xarray calls out to one of its combine functions. This logic has gotten more complex over the years as different options have been introduced, but the gist is this: https://github.com/pydata/xarray/blob/577d3a75ea8bb25b99f9d31af8da14210cddff78/xarray/backends/api.py#L947-L952

You can reproduce this step outside of xarray, e.g.

ds = xr.concat(datasets, dim='time')

At that point, various checks will kick in to be sure that the coordinates in the different datasets are compatible. Performing these checks requires the data to be read eagerly, which can be a source of slow performance.

Without seeing more details about your files, it’s hard to know exactly where the issue lies. A good place to start is to simply drop all coordinates from your data as a preprocessing step.

def drop_all_coords(ds):
    return ds.reset_coords(drop=True)

xr.open_mfdataset('*.nc', combine='by_coords', preprocess=drop_all_coords)

If you observe a big speedup, this points at coordinate compatibility checks as the culprit. From there you can experiment with the various options for open_mfdataset, such as coords='minimal', compat='override', etc.

Once you post your file details, we can provide more concrete suggestions.

rabernat on Dec 5, 2019

Wow, thank you!

This is an amazing bug. The defaults say data_vars="all", coords="different" which means always concatenate all data_vars along the concat dimensions (here inferred to be “time”) but only concatenate coords if they differ in the different files.

When decode_cf=False , lat ,lon are data_vars and get concatenated without any checking or reading. When decode_cf=True, lat, lon are promoted to coords, then get checked for equality across all files. The two variables get read sequentially from all files. This is the slowdown you see.

Once again, this is a consequence of bad defaults for concat and open_mfdataset.

I would follow https://docs.xarray.dev/en/stable/user-guide/io.html#reading-multi-file-datasets and use data_vars="minimal", coords="minimal", compat="override" which will only concatenate those variables with the time dimension, and skip any checking for variables that don’t have a time dimension (simply pick the variable from the first file).

dcherian on Feb 22, 2024

OK, so it seems that we need a change to disable wrapping dask arrays with LazilyIndexedArray. Dask arrays are already lazy!

shoyer on Mar 9, 2018

This issue is almost seven years old! It has been “fixed” many times since my original post, but people keep finding new ways to make it reappear. 😆

It seems like having better diagnostics / logging of what is happening under the hood with open_mfdataset is what people really need. Maybe even some sort of utility to pre-scan the files and figure out if they are easily openable or not.

rabernat on Mar 14, 2024

I’ve recently been trying to run open_mfdataset on a large list of large files. When using more than ~100 files the function became so slow that I gave up trying to run it. I then came upon this thread and discovered that if I passed the argument decode_cf=False the function would run in a matter of seconds. Applying decode_cf to the returned dataset after opening then ran in seconds and I ended up with the same dataset following this two step process as I did before. Would it be possible to change one of:

where decode_cf is called in open_mfdataset — essentially, open the individual datasets with decode_cf=False and then apply decode_cf to the merged dataset before it is returned;
change decode_cf=False to be the default in open_mfdataset?

To me the first solution feels better and I can make a pull request to do this.

From reading this thread I’m under the impression that there’s probably something else going on under the hood that’s causing the slowness of open_mfdataset at present. Obviously it would be best to address this; however, given that the problem was first raised in 2017 and a solution to the underlying problem doesn’t seem to be forthcoming I’d be very pleased to see a “fix” that addresses the symptoms (the slowness) rather than the illness (whatever’s going wrong behind the scenes).

fraserwg on Aug 25, 2023

An update on this long-standing issue.

I have learned that open_mfdataset can be blazingly fast if decode_cf=False but extremely slow with decode_cf=True.

As an example, I am loading a POP datataset on cheyenne. Anyone with access can try this example.

base_dir = '/glade/scratch/rpa/'
prefix = 'BRCP85C5CN_ne120_t12_pop62.c13b17.asdphys.001'
code = 'pop.h.nday1.SST'
glob_pattern = os.path.join(base_dir, prefix, '%s.%s.*.nc' % (prefix, code))

def non_time_coords(ds):
    return [v for v in ds.data_vars
            if 'time' not in ds[v].dims]

def drop_non_essential_vars_pop(ds):
    return ds.drop(non_time_coords(ds))   

# this runs almost instantly
ds = xr.open_mfdataset(glob_pattern, decode_times=False, chunks={'time': 1},
                       preprocess=drop_non_essential_vars_pop, decode_cf=False)

And returns this

<xarray.Dataset>
Dimensions:     (d2: 2, nlat: 2400, nlon: 3600, time: 16401, z_t: 62, z_t_150m: 15, z_w: 62, z_w_bot: 62, z_w_top: 62)
Coordinates:
  * z_w_top     (z_w_top) float32 0.0 1000.0 2000.0 3000.0 4000.0 5000.0 ...
  * z_t         (z_t) float32 500.0 1500.0 2500.0 3500.0 4500.0 5500.0 ...
  * z_w         (z_w) float32 0.0 1000.0 2000.0 3000.0 4000.0 5000.0 6000.0 ...
  * z_t_150m    (z_t_150m) float32 500.0 1500.0 2500.0 3500.0 4500.0 5500.0 ...
  * z_w_bot     (z_w_bot) float32 1000.0 2000.0 3000.0 4000.0 5000.0 6000.0 ...
  * time        (time) float64 7.322e+05 7.322e+05 7.322e+05 7.322e+05 ...
Dimensions without coordinates: d2, nlat, nlon
Data variables:
    time_bound  (time, d2) float64 dask.array<shape=(16401, 2), chunksize=(1, 2)>
    SST         (time, nlat, nlon) float32 dask.array<shape=(16401, 2400, 3600), chunksize=(1, 2400, 3600)>
Attributes:
    nsteps_total:  480
    tavg_sum:      64800.0
    title:         BRCP85C5CN_ne120_t12_pop62.c13b17.asdphys.001
    start_time:    This dataset was created on 2016-03-14 at 05:32:30.3
    Conventions:   CF-1.0; http://www.cgd.ucar.edu/cms/eaton/netcdf/CF-curren...
    source:        CCSM POP2, the CCSM Ocean Component
    cell_methods:  cell_methods = time: mean ==> the variable values are aver...
    calendar:      All years have exactly  365 days.
    history:       none
    contents:      Diagnostic and Prognostic Variables
    revision:      $Id: tavg.F90 56176 2013-12-20 18:35:46Z mlevy@ucar.edu $

This is roughly 45 years of daily data, one file per year.

Instead, if I just change decode_cf=True (the default), it takes forever. I can monitor what is happening via the distributed dashboard. It looks like this:

There are more of these open_dataset tasks then there are number of files (45), so I can only presume there are 16401 individual tasks (one for each timestep), which each takes about 1 s in serial.

This is a real failure of lazy decoding. Maybe it can be fixed by #1725, possibly related to #1372.

cc Pangeo folks: @jhamman, @mrocklin

rabernat on Mar 2, 2018

@shoyer I will double check what the bottle neck is and report back. @dcherian interestingly, the parallel option doesn’t seem to do anything when decode_cf=True. From looking at the dask dashboard it seems to load each file sequentially, with each opening being carried out by a different worker but not concurrently. I will see what I can do minimal example wise!

fraserwg on Sep 7, 2023

@rabernat wrote:

An update on this long-standing issue.

I have learned that open_mfdataset can be blazingly fast if decode_cf=False but extremely slow with decode_cf=True.

I seem to be experiencing a similar (same?) issue with open_dataset: https://stackoverflow.com/questions/71147712/can-i-force-xarray-open-dataset-to-do-a-lazy-load?stw=2

jtomfarrar on Feb 17, 2022

So is there any word on a best practice, fix, or workaround with the MFDataset performance? Still getting abysmal reading perfomance with a list of NetCDF files that represent sequential times. I want to use MFDataset to chunk multiple time steps into memory at once but its taking 5-10 minutes to construct MFDataset objects and even longer to run .values on it.

keltonhalbert on Dec 4, 2019