xarray: Aggregating a dimension using the Quantiles method with `skipna=True` is very slow
What happened?
Hi all,
as the title already summarizes, I’m running into performance issues when aggregating over the time-dimension of a 3D DataArray using the quantiles method with skipna=True
.
See the section below for some dummy data that represents what I’m working with (e.g., similar to this). Aggregating over the time-dimension of this dummy data I’m getting the following wall times:
1 | da.median(dim='time', skipna=True) |
1.35 s |
2 | da.quantile(0.95, dim='time', skipna=False) |
5.95 s |
3 | da.quantile(0.95, dim='time', skipna=True) |
6 min 6s |
I’m currently using a compute node with 40 CPUs and 180 GB RAM. Here is what the resource utilization looks like. First small bump are 1 and 2. Second longer peak is 3.
In this small example, the process at least finishes after a few seconds. With my actual dataset the quantile calculation takes hours…
I guess the following issue is relevant and should be revived: https://github.com/numpy/numpy/issues/16575
Are there any possible work-arounds?
What did you expect to happen?
No response
Minimal Complete Verifiable Example
import pandas as pd
import numpy as np
import xarray as xr
# Create dummy data with 20% random NaNs
size_spatial = 2000
size_temporal = 20
n_nan = int(size_spatial**2*0.2)
time = pd.date_range("2000-01-01", periods=size_temporal)
lat = np.random.uniform(low=-90, high=90, size=size_spatial)
lon = np.random.uniform(low=-180, high=180, size=size_spatial)
data = np.random.rand(size_temporal, size_spatial, size_spatial)
index_nan = np.random.choice(data.size, n_nan, replace=False)
data.ravel()[index_nan] = np.nan
# Create DataArray
da = xr.DataArray(data=data,
dims=['time', 'x', 'y'],
coords={'time': time, 'x': lon, 'y': lat},
attrs={'nodata': np.nan})
# Calculate 95th quantile over time-dimension
da.quantile(0.95, dim='time', skipna=True)
MVCE confirmation
- Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
- Complete example — the example is self-contained, including all data and the text of any traceback.
- Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
- New issue — a search of GitHub Issues suggests this is not a duplicate.
Relevant log output
No response
Anything else we need to know?
No response
Environment
INSTALLED VERSIONS
commit: None python: 3.10.6 | packaged by conda-forge | (main, Aug 22 2022, 20:36:39) [GCC 10.4.0] python-bits: 64 OS: Linux OS-release: 5.4.0-125-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: en_US.UTF-8 LANG: en_US.UTF-8 LOCALE: (‘en_US’, ‘UTF-8’) libhdf5: 1.12.2 libnetcdf: 4.9.0
xarray: 2022.12.0 pandas: 1.5.0 numpy: 1.23.3 scipy: 1.9.1 netCDF4: 1.6.1 pydap: None h5netcdf: None h5py: 3.7.0 Nio: None zarr: None cftime: 1.6.2 nc_time_axis: None PseudoNetCDF: None rasterio: 1.3.3 cfgrib: None iris: None bottleneck: 1.3.5 dask: 2022.10.0 distributed: 2022.10.0 matplotlib: 3.6.1 cartopy: 0.21.0 seaborn: 0.12.0 numbagg: None fsspec: 2022.8.2 cupy: None pint: None sparse: None flox: None numpy_groupies: None setuptools: 65.5.0 pip: 22.3 conda: 4.12.0 pytest: None mypy: None IPython: 8.5.0 sphinx: None
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 17 (10 by maintainers)
AFAIK, performance increasing backends such as
numbagg
andbottleneck
are optional dependencies. So by defaultnumpy.nanquantile
will be used. In cases where users havenumbagg
installed but want to disable it, they could do so usingxarray.set_options
:Here is a test with a real-world example (5 year time series of Sentinel-1 RTC data):
This is the numbagg wrapper I’ve used (same as I’ve posted above).
Yeah, having it run in parallel is going to generally be much faster. And avoiding multiple processes / dask is going to generally save on overhead.
A welcome PR would be to add numbagg’s
nanpercentile
to xarray, in the same way that rolling methods run through numbagg by default if it’s installed.(or if we had benchmarks showing another approach was more performant, happy to consider that)
Recently, I create a library for the same purpose, maybe it can help other people. It can speed up quantile computations a lot. To install, just run:
pip install fastnanquantile
It’s compatible with dask. More info: https://github.com/lbferreira/fastnanquantileExample below:
Hi all, I just created a simple workaround, which might be useful for others:
https://gist.github.com/maawoo/0b34d371c3cc1960a1589ccaded868c2
It uses the
_nan_quantile
method of xclim and works fine for my applications. Here is a quick comparison using the same example data as in my initial post:EDIT: I’ve updated the code to use numbagg instead of xclim.
Thanks @arongergely! I have mentioned the numpy issue in my post above (FYI, for anyone looking for it). I was really surprised to see that it’s over 2 years old and that this is now the first Xarray issue referencing it. If it’s really a “well known” issue, I think it should have been somehow mentioned in the Xarray quantiles method.
I have seen the blog post and tried to use the workaround with apply_ufunc and Dask but ran into some problems. I’ll revisit that when I have some time and will also check xclim. Seems to be very promising, thanks!
Happy holidays! 🎄