xarray: Aggregating a dimension using the Quantiles method with `skipna=True` is very slow

What happened?

Hi all, as the title already summarizes, I’m running into performance issues when aggregating over the time-dimension of a 3D DataArray using the quantiles method with skipna=True. See the section below for some dummy data that represents what I’m working with (e.g., similar to this). Aggregating over the time-dimension of this dummy data I’m getting the following wall times:


1	`da.median(dim='time', skipna=True)`	1.35 s
2	`da.quantile(0.95, dim='time', skipna=False)`	5.95 s
3	`da.quantile(0.95, dim='time', skipna=True)`	6 min 6s

I’m currently using a compute node with 40 CPUs and 180 GB RAM. Here is what the resource utilization looks like. First small bump are 1 and 2. Second longer peak is 3.

Screenshot 2022-12-14 at 17 33 14

In this small example, the process at least finishes after a few seconds. With my actual dataset the quantile calculation takes hours…

I guess the following issue is relevant and should be revived: https://github.com/numpy/numpy/issues/16575

Are there any possible work-arounds?

What did you expect to happen?

No response

Minimal Complete Verifiable Example

import pandas as pd
import numpy as np
import xarray as xr

# Create dummy data with 20% random NaNs
size_spatial = 2000
size_temporal = 20
n_nan = int(size_spatial**2*0.2)

time = pd.date_range("2000-01-01", periods=size_temporal)
lat = np.random.uniform(low=-90, high=90, size=size_spatial)
lon = np.random.uniform(low=-180, high=180, size=size_spatial)
data = np.random.rand(size_temporal, size_spatial, size_spatial)
index_nan = np.random.choice(data.size, n_nan, replace=False)
data.ravel()[index_nan] = np.nan

# Create DataArray
da = xr.DataArray(data=data, 
                  dims=['time', 'x', 'y'], 
                  coords={'time': time, 'x': lon, 'y': lat}, 
                  attrs={'nodata': np.nan})

# Calculate 95th quantile over time-dimension
da.quantile(0.95, dim='time', skipna=True)

MVCE confirmation

Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
Complete example — the example is self-contained, including all data and the text of any traceback.
Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
New issue — a search of GitHub Issues suggests this is not a duplicate.

Relevant log output

No response

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS

commit: None python: 3.10.6 | packaged by conda-forge | (main, Aug 22 2022, 20:36:39) [GCC 10.4.0] python-bits: 64 OS: Linux OS-release: 5.4.0-125-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: en_US.UTF-8 LANG: en_US.UTF-8 LOCALE: (‘en_US’, ‘UTF-8’) libhdf5: 1.12.2 libnetcdf: 4.9.0

xarray: 2022.12.0 pandas: 1.5.0 numpy: 1.23.3 scipy: 1.9.1 netCDF4: 1.6.1 pydap: None h5netcdf: None h5py: 3.7.0 Nio: None zarr: None cftime: 1.6.2 nc_time_axis: None PseudoNetCDF: None rasterio: 1.3.3 cfgrib: None iris: None bottleneck: 1.3.5 dask: 2022.10.0 distributed: 2022.10.0 matplotlib: 3.6.1 cartopy: 0.21.0 seaborn: 0.12.0 numbagg: None fsspec: 2022.8.2 cupy: None pint: None sparse: None flox: None numpy_groupies: None setuptools: 65.5.0 pip: 22.3 conda: 4.12.0 pytest: None mypy: None IPython: 8.5.0 sphinx: None

About this issue

Original URL
State: closed
Created 2 years ago
Comments: 17 (10 by maintainers)

Most upvoted comments

AFAIK, performance increasing backends such as numbagg and bottleneck are optional dependencies. So by default numpy.nanquantile will be used. In cases where users have numbagg installed but want to disable it, they could do so using xarray.set_options:

with xr.set_options(use_numbagg=False):
    # do something

maawoo on Jan 30, 2024

Here is a test with a real-world example (5 year time series of Sentinel-1 RTC data):

This is the numbagg wrapper I’ve used (same as I’ve posted above).

maawoo on Jan 30, 2024

Yeah, having it run in parallel is going to generally be much faster. And avoiding multiple processes / dask is going to generally save on overhead.

A welcome PR would be to add numbagg’s nanpercentile to xarray, in the same way that rolling methods run through numbagg by default if it’s installed.

(or if we had benchmarks showing another approach was more performant, happy to consider that)

max-sixty on Jan 29, 2024

Recently, I create a library for the same purpose, maybe it can help other people. It can speed up quantile computations a lot. To install, just run: pip install fastnanquantile It’s compatible with dask. More info: https://github.com/lbferreira/fastnanquantile

Example below:


import numpy as np
import xarray as xr
from fastnanquantile import xrcompat

da = xr.DataArray(
    np.random.rand(10, 1000, 1000),
    coords={"time": np.arange(10), "x": np.arange(1000), "y": np.arange(1000)},
)

# Xarray quantile (time to run: ~25s)
result_xr = da.quantile(q=0.6, dim="time")
# fastnanquantile (time to run: <1s)
result_fnq = xrcompat.xr_apply_nanquantile(da, q=0.6, dim="time")
# Check if results are equal (If results are different, an error will be raised)
np.testing.assert_almost_equal(result_fnq.values, result_xr.values, decimal=4)

lbferreira on Jan 29, 2024

Hi all, I just created a simple workaround, which might be useful for others:
https://gist.github.com/maawoo/0b34d371c3cc1960a1589ccaded868c2

It uses the _nan_quantile method of xclim and works fine for my applications. Here is a quick comparison using the same example data as in my initial post: Screenshot 2023-01-27 at 13 41 42

EDIT: I’ve updated the code to use numbagg instead of xclim.

maawoo on Jan 3, 2024

Thanks @arongergely! I have mentioned the numpy issue in my post above (FYI, for anyone looking for it). I was really surprised to see that it’s over 2 years old and that this is now the first Xarray issue referencing it. If it’s really a “well known” issue, I think it should have been somehow mentioned in the Xarray quantiles method.

I have seen the blog post and tried to use the workaround with apply_ufunc and Dask but ran into some problems. I’ll revisit that when I have some time and will also check xclim. Seems to be very promising, thanks!

Happy holidays! 🎄

maawoo on Dec 22, 2022