xclim: Xclim Standardised Precipitation Index always throws error

Setup Information

  • Xclim version: 0.39.0
  • Python version:3.8.10
  • Operating System:Linux

Description

I’m attempting to calculate Standardised Preciptation Index using Xclim with two data arrays (reference period and projection). Even when specifying the following chunks:

hist = xr.open_dataarray("./historic.nc",chunks={"time":-1,"lat":50,"lon":50})

The SPI indicator function always throws the following error:

"The input data is chunked on time dimension and must be fully rechunked to"
                    " run `fit` on groups ."
                    " Beware, this operation can significantly increase the number of tasks dask"
                    " has to handle.",

From a quick look in the source code, this is due to the resample function in xarray not preserving the time based chunking when resampling i.e. it is chunked time:1, lat:XX, Lon:XX after resampling. Therefore the following code (from the source code) always throws an error and is extremely confusing.

   if freq != "D":
        pr = pr.resample(time=freq).mean(keep_attrs=True)
        pr_cal = pr_cal.resample(time=freq).mean(keep_attrs=True)

        def needs_rechunking(da):
            if uses_dask(da) and len(da.chunks[da.get_axis_num("time")]) > 1:
                warnings.warn(
                    "The input data is chunked on time dimension and must be fully rechunked to"
                    " run `fit` on groups ."
                    " Beware, this operation can significantly increase the number of tasks dask"
                    " has to handle.",
                    stacklevel=2,
                )
                return True
            return False

        if needs_rechunking(pr):
            pr = pr.chunk({"time": -1})
        if needs_rechunking(pr_cal):
            pr_cal = pr_cal.chunk({"time": -1})

This also leads to the following Xarray warning here:

/home/ubuntu/.local/lib/python3.8/site-packages/xarray/core/indexing.py:1374: PerformanceWarning: Slicing with an out-of-order index is generating 95 times more chunks return self.array[key]

This could also explain the increasing number of dask tasks generated by the SPI indicator function here as well as I think multiple rounds of chunking (one done by the user and one internally) can drastically increase the number of tasks and the movement of data around in RAM.

Something to look out for.

Let me know if there’s something wrong with what I’m saying here, happy to discuss.

Thanks!

Steps To Reproduce

No response

Additional context

No response

Contribution

  • I would be willing/able to open a Pull Request to address this bug.

Code of Conduct

  • I agree to follow this project’s Code of Conduct

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 15 (5 by maintainers)

Commits related to this issue

Most upvoted comments

In my use case, another possible performance improvement could be achieved by having the ability to pre-compute the fitting parameters and then pass them to the SPI function, similar to what is possible in the climate_indices package here.

EDIT: I did a bit more digging on the “readthedocs” and found a code snippet - very helpful, and if every function could have this, that would be amazing:[https://xclim.readthedocs.io/en/stable/xclim.indices.html#xclim.indices._agro.standardized_precipitation_index], this helped me to understand the code better, and how sampling is used; now I can improve the prediction by using pr_cal correctly. –±- --±- --±-

Following the discussion on large datasets - I only discovered xclim this week, and working through the readthedocs (fair point, the package is new, but docstrings could be improved). The reason I am commenting here is on the SPI calculation run time, and GrahamReveley comments about the huge amount of tasks: are there any benchmarks on run times? I am running locally, on average hardware, the following dataset: xarray.DataArray ‘pr’ time: 7670 y: 125 x: 185 (daily data between 1985 and 2005). I don’t have pr_cal data, so feeding the same pr as pr_cal (the documentation is not clear on this point), and I am now wondering if I should perhaps break this up in smaller arrays let’s say 25x25, but any optimisation strategy is welcome, as I would like to run future predictions as well (time: 35000).

I have been looking for a package like this for some time, and it looks great, so happy to contribute (by testing) and collaborate where possible.