xarray: Slow performance of isel
Hi,
I get a very slow performance of Dataset.isel or DataArray.isel in comparison with the native numpy approach. Do you know where this comes from?
ds = xr.Dataset(
{
"a": ("time", np.arange(55_000_000))
}, coords={
"time": np.arange(55_000_000)
}
)
time_filter = ds.time > 50_000
Select some values with DataArray.isel:
%timeit ds.a.isel(time=time_filter)
2.22 s ± 375 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Use the native numpy approach:
%timeit ds.a.values[time_filter]
163 ms ± 12.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
INSTALLED VERSIONS
------------------
commit: None
python: 3.6.5.final.0
python-bits: 64
OS: Linux
OS-release: 3.16.0-4-amd64
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: en_US.utf8
LOCALE: en_US.UTF-8
xarray: 0.10.4 pandas: 0.23.0 numpy: 1.14.2 scipy: 1.1.0 netCDF4: 1.4.0 h5netcdf: 0.5.1 h5py: 2.8.0 Nio: None zarr: None bottleneck: 1.2.1 cyordereddict: None dask: 0.17.5 distributed: 1.21.8 matplotlib: 2.2.2 cartopy: 0.16.0 seaborn: 0.8.1 setuptools: 39.1.0 pip: 9.0.3 conda: None pytest: 3.5.1 IPython: 6.4.0 sphinx: 1.7.4
About this issue
- Original URL
- State: open
- Created 6 years ago
- Comments: 28 (23 by maintainers)
Commits related to this issue
- Propagate indexes in DataArray binary operations. Works by propagating indexes in DataArray._replace. xref #2227. Tests pass! — committed to dcherian/xarray by dcherian 5 years ago
- Propagate indexes in DataArray binary operations. (#3481) * Propagate indexes in DataArray binary operations. Works by propagating indexes in DataArray._replace. xref #2227. Tests pass! * re... — committed to pydata/xarray by dcherian 5 years ago
I don’t know much about indexing but that PR propagates a “new” indexes property as part of #1603 (work towards enabling more flexible indexing), it doesn’t change anything about “indexing”. I think the dask docs may be more relevant to what you may be asking about: https://docs.dask.org/en/latest/array-slicing.html
My measurements:
Given the size of this gap, I suspect this could be improved with some investigation and profiling, but there is certainly an upper-limit on the possible performance gain.
One simple example is that indexing the dataset needs to index both
'a'
and'time'
, so it’s going to be at least twice as slow as only indexing'a'
. So the second indexing expressionds.a.isel(time=time_filter.values)
is only447/(169*2) = 1.32
times slower than the best case scenario.Another part of the matrix of possibilities. Takes about half the time if you pass
time_filter.values
(numpy array) rather than thetime_filter
DataArray:I don’t have experience using
isel
with boolean indexing. (Although the docs on positional indexing claim it is supported.) My guess is that that the time is being spent aligning the indexer with the array, which is unnecessary since you know they are already aligned. Probably not the most efficient pattern for xarray.Here’s how I would recommend writing the query using label-based selection: