xarray: Slow performance of isel

Hi,

I get a very slow performance of Dataset.isel or DataArray.isel in comparison with the native numpy approach. Do you know where this comes from?

ds = xr.Dataset(
    {
        "a": ("time", np.arange(55_000_000))
    }, coords={
        "time": np.arange(55_000_000)
    }
)
time_filter = ds.time > 50_000

Select some values with DataArray.isel:

%timeit ds.a.isel(time=time_filter)

2.22 s ± 375 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Use the native numpy approach:

%timeit ds.a.values[time_filter]

163 ms ± 12.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

INSTALLED VERSIONS ------------------ commit: None python: 3.6.5.final.0 python-bits: 64 OS: Linux OS-release: 3.16.0-4-amd64 machine: x86_64 processor: byteorder: little LC_ALL: None LANG: en_US.utf8 LOCALE: en_US.UTF-8

xarray: 0.10.4 pandas: 0.23.0 numpy: 1.14.2 scipy: 1.1.0 netCDF4: 1.4.0 h5netcdf: 0.5.1 h5py: 2.8.0 Nio: None zarr: None bottleneck: 1.2.1 cyordereddict: None dask: 0.17.5 distributed: 1.21.8 matplotlib: 2.2.2 cartopy: 0.16.0 seaborn: 0.8.1 setuptools: 39.1.0 pip: 9.0.3 conda: None pytest: 3.5.1 IPython: 6.4.0 sphinx: 1.7.4

About this issue

  • Original URL
  • State: open
  • Created 6 years ago
  • Comments: 28 (23 by maintainers)

Commits related to this issue

Most upvoted comments

I don’t know much about indexing but that PR propagates a “new” indexes property as part of #1603 (work towards enabling more flexible indexing), it doesn’t change anything about “indexing”. I think the dask docs may be more relevant to what you may be asking about: https://docs.dask.org/en/latest/array-slicing.html

My measurements:

>>> %timeit ds.a.isel(time=time_filter)
1 loop, best of 3: 906 ms per loop
>>> %timeit ds.a.isel(time=time_filter.values)
1 loop, best of 3: 447 ms per loop
>>> %timeit ds.a.values[time_filter]
10 loops, best of 3: 169 ms per loop

Given the size of this gap, I suspect this could be improved with some investigation and profiling, but there is certainly an upper-limit on the possible performance gain.

One simple example is that indexing the dataset needs to index both 'a' and 'time', so it’s going to be at least twice as slow as only indexing 'a'. So the second indexing expression ds.a.isel(time=time_filter.values) is only 447/(169*2) = 1.32 times slower than the best case scenario.

Another part of the matrix of possibilities. Takes about half the time if you pass time_filter.values (numpy array) rather than the time_filter DataArray:

%timeit ds.a.isel(time=time_filter.values)
1.3 s ± 67.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I don’t have experience using isel with boolean indexing. (Although the docs on positional indexing claim it is supported.) My guess is that that the time is being spent aligning the indexer with the array, which is unnecessary since you know they are already aligned. Probably not the most efficient pattern for xarray.

Here’s how I would recommend writing the query using label-based selection:

%timeit ds.a.sel(time=slice(50_001, None))
117 ms ± 5.29 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)