xarray: slow performance with open_mfdataset
We have a dataset stored across multiple netCDF files. We are getting very slow performance with open_mfdataset
, and I would like to improve this.
Each individual netCDF file looks like this:
%time ds_single = xr.open_dataset('float_trajectories.0000000000.nc')
ds_single
CPU times: user 14.9 ms, sys: 48.4 ms, total: 63.4 ms
Wall time: 60.8 ms
<xarray.Dataset>
Dimensions: (npart: 8192000, time: 1)
Coordinates:
* time (time) datetime64[ns] 1993-01-01
* npart (npart) int32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ...
Data variables:
z (time, npart) float32 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 ...
vort (time, npart) float32 -9.71733e-10 -9.72858e-10 -9.73001e-10 ...
u (time, npart) float32 0.000545563 0.000544884 0.000544204 ...
v (time, npart) float32 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ...
x (time, npart) float32 180.016 180.047 180.078 180.109 180.141 ...
y (time, npart) float32 -79.9844 -79.9844 -79.9844 -79.9844 ...
As shown above, a single data file opens in ~60 ms.
When I call open_mdsdataset
on 49 files (each with a different time
dimension but the same npart
), here is what happens:
%time ds = xr.open_mfdataset('*.nc', )
ds
CPU times: user 1min 31s, sys: 25.4 s, total: 1min 57s
Wall time: 2min 4s
<xarray.Dataset>
Dimensions: (npart: 8192000, time: 49)
Coordinates:
* npart (npart) int64 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ...
* time (time) datetime64[ns] 1993-01-01 1993-01-02 1993-01-03 ...
Data variables:
z (time, npart) float64 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 ...
vort (time, npart) float64 -9.717e-10 -9.729e-10 -9.73e-10 -9.73e-10 ...
u (time, npart) float64 0.0005456 0.0005449 0.0005442 0.0005437 ...
v (time, npart) float64 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ...
x (time, npart) float64 180.0 180.0 180.1 180.1 180.1 180.2 180.2 ...
y (time, npart) float64 -79.98 -79.98 -79.98 -79.98 -79.98 -79.98 ...
It takes over 2 minutes to open the dataset. Specifying concat_dim='time'
does not improve performance.
Here is %prun
of the open_mfdataset
command.
748994 function calls (724222 primitive calls) in 142.160 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
49 62.455 1.275 62.458 1.275 {method 'get_indexer' of 'pandas.index.IndexEngine' objects}
49 47.207 0.963 47.209 0.963 base.py:1067(is_unique)
196 7.198 0.037 7.267 0.037 {operator.getitem}
49 4.632 0.095 4.687 0.096 netCDF4_.py:182(_open_netcdf4_group)
240 3.189 0.013 3.426 0.014 numeric.py:2476(array_equal)
98 1.937 0.020 1.937 0.020 {numpy.core.multiarray.arange}
4175/3146 1.867 0.000 9.296 0.003 {numpy.core.multiarray.array}
49 1.525 0.031 119.144 2.432 alignment.py:251(reindex_variables)
24 1.065 0.044 1.065 0.044 {method 'cumsum' of 'numpy.ndarray' objects}
12 1.010 0.084 1.010 0.084 {method 'sort' of 'numpy.ndarray' objects}
5227/4035 0.660 0.000 1.688 0.000 collections.py:50(__init__)
12 0.600 0.050 3.238 0.270 core.py:2761(insert)
12691/7497 0.473 0.000 0.875 0.000 indexing.py:363(shape)
110728 0.425 0.000 0.663 0.000 {isinstance}
12 0.413 0.034 0.413 0.034 {method 'flatten' of 'numpy.ndarray' objects}
12 0.341 0.028 0.341 0.028 {numpy.core.multiarray.where}
2 0.333 0.166 0.333 0.166 {pandas._join.outer_join_indexer_int64}
1 0.331 0.331 142.164 142.164 <string>:1(<module>)
It looks like most of the time is being spent on reindex_variables
. I understand why this happens…xarray needs to make sure the dimensions are the same in order to concatenate them together.
Is there any obvious way I could improve the load time? For example, can I give a hint to xarray that this reindex_variables
step is not necessary, since I know that all the npart
dimensions are the same in each file?
About this issue
- Original URL
- State: open
- Created 7 years ago
- Reactions: 1
- Comments: 52 (27 by maintainers)
In your twitter thread you said
The general reason for this is usually that
open_mfdataset
performs coordinate compatibility checks when it concatenates the files. It’s useful to actually read the code of open_mfdataset to see how it works.First, all the files are opened individually https://github.com/pydata/xarray/blob/577d3a75ea8bb25b99f9d31af8da14210cddff78/xarray/backends/api.py#L900-L903
You can recreate this step outside of xarray yourself by doing something like
Once each dataset is open, xarray calls out to one of its combine functions. This logic has gotten more complex over the years as different options have been introduced, but the gist is this: https://github.com/pydata/xarray/blob/577d3a75ea8bb25b99f9d31af8da14210cddff78/xarray/backends/api.py#L947-L952
You can reproduce this step outside of xarray, e.g.
At that point, various checks will kick in to be sure that the coordinates in the different datasets are compatible. Performing these checks requires the data to be read eagerly, which can be a source of slow performance.
Without seeing more details about your files, it’s hard to know exactly where the issue lies. A good place to start is to simply drop all coordinates from your data as a preprocessing step.
If you observe a big speedup, this points at coordinate compatibility checks as the culprit. From there you can experiment with the various options for
open_mfdataset
, such ascoords='minimal', compat='override'
, etc.Once you post your file details, we can provide more concrete suggestions.
Wow, thank you!
This is an amazing bug. The defaults say
data_vars="all", coords="different"
which means always concatenate all data_vars along the concat dimensions (here inferred to be “time”) but only concatenate coords if they differ in the different files.When
decode_cf=False
,lat
,lon
are data_vars and get concatenated without any checking or reading. Whendecode_cf=True
,lat
,lon
are promoted to coords, then get checked for equality across all files. The two variables get read sequentially from all files. This is the slowdown you see.Once again, this is a consequence of bad defaults for
concat
andopen_mfdataset
.I would follow https://docs.xarray.dev/en/stable/user-guide/io.html#reading-multi-file-datasets and use
data_vars="minimal", coords="minimal", compat="override"
which will only concatenate those variables with the time dimension, and skip any checking for variables that don’t have a time dimension (simply pick the variable from the first file).OK, so it seems that we need a change to disable wrapping dask arrays with
LazilyIndexedArray
. Dask arrays are already lazy!This issue is almost seven years old! It has been “fixed” many times since my original post, but people keep finding new ways to make it reappear. 😆
It seems like having better diagnostics / logging of what is happening under the hood with open_mfdataset is what people really need. Maybe even some sort of utility to pre-scan the files and figure out if they are easily openable or not.
I’ve recently been trying to run
open_mfdataset
on a large list of large files. When using more than ~100 files the function became so slow that I gave up trying to run it. I then came upon this thread and discovered that if I passed the argumentdecode_cf=False
the function would run in a matter of seconds. Applyingdecode_cf
to the returned dataset after opening then ran in seconds and I ended up with the same dataset following this two step process as I did before. Would it be possible to change one of:decode_cf
is called inopen_mfdataset
— essentially, open the individual datasets withdecode_cf=False
and then applydecode_cf
to the merged dataset before it is returned;decode_cf=False
to be the default inopen_mfdataset
?To me the first solution feels better and I can make a pull request to do this.
From reading this thread I’m under the impression that there’s probably something else going on under the hood that’s causing the slowness of
open_mfdataset
at present. Obviously it would be best to address this; however, given that the problem was first raised in 2017 and a solution to the underlying problem doesn’t seem to be forthcoming I’d be very pleased to see a “fix” that addresses the symptoms (the slowness) rather than the illness (whatever’s going wrong behind the scenes).An update on this long-standing issue.
I have learned that
open_mfdataset
can be blazingly fast ifdecode_cf=False
but extremely slow withdecode_cf=True
.As an example, I am loading a POP datataset on cheyenne. Anyone with access can try this example.
And returns this
This is roughly 45 years of daily data, one file per year.
Instead, if I just change
decode_cf=True
(the default), it takes forever. I can monitor what is happening via the distributed dashboard. It looks like this:There are more of these
open_dataset
tasks then there are number of files (45), so I can only presume there are 16401 individual tasks (one for each timestep), which each takes about 1 s in serial.This is a real failure of lazy decoding. Maybe it can be fixed by #1725, possibly related to #1372.
cc Pangeo folks: @jhamman, @mrocklin
@shoyer I will double check what the bottle neck is and report back. @dcherian interestingly, the
parallel
option doesn’t seem to do anything whendecode_cf=True
. From looking at the dask dashboard it seems to load each file sequentially, with each opening being carried out by a different worker but not concurrently. I will see what I can do minimal example wise!@rabernat wrote:
I seem to be experiencing a similar (same?) issue with open_dataset: https://stackoverflow.com/questions/71147712/can-i-force-xarray-open-dataset-to-do-a-lazy-load?stw=2
So is there any word on a best practice, fix, or workaround with the MFDataset performance? Still getting abysmal reading perfomance with a list of NetCDF files that represent sequential times. I want to use MFDataset to chunk multiple time steps into memory at once but its taking 5-10 minutes to construct MFDataset objects and even longer to run .values on it.