xarray: why time grouping doesn't preserve chunks

Code Sample, a copy-pastable example if possible

I am continuing my quest to obtain more efficient time grouping for calculation of climatologies and climatological anomalies. I believe this is one of the major performance bottlenecks facing xarray users today. I have raised this in other issues (e.g. #1832), but I believe I have narrowed it down here to a more specific problem.

The easiest way to summarize the problem is with an example. Consider the following dataset

import xarray as xr
ds = xr.Dataset({'foo': (['x'], [1, 1, 1, 1])},
                coords={'x': (['x'], [0, 1, 2, 3]),
                        'bar': (['x'], ['a', 'a', 'b', 'b']),
                        'baz': (['x'], ['a', 'b', 'a', 'b'])})
ds = ds.chunk({'x': 2})
ds

<xarray.Dataset>
Dimensions:  (x: 4)
Coordinates:
  * x        (x) int64 0 1 2 3
    bar      (x) <U1 dask.array<shape=(4,), chunksize=(2,)>
    baz      (x) <U1 dask.array<shape=(4,), chunksize=(2,)>
Data variables:
    foo      (x) int64 dask.array<shape=(4,), chunksize=(2,)>

One non-dimension coordinate (bar) is contiguous with respect to x while the other baz is not. This is important. baz is structured similar to the way that month would be distributed on a timeseries dataset.

Now let’s do a trivial groupby operation on bar that does nothing, just returns the group unchanged:

ds.foo.groupby('bar').apply(lambda x: x)

<xarray.DataArray 'foo' (x: 4)>
dask.array<shape=(4,), dtype=int64, chunksize=(2,)>
Coordinates:
  * x        (x) int64 0 1 2 3
    bar      (x) <U1 dask.array<shape=(4,), chunksize=(2,)>
    baz      (x) <U1 dask.array<shape=(4,), chunksize=(2,)>

This operation preserved this original chunks in foo. But if we group by baz we see something different

ds.foo.groupby('baz').apply(lambda x: x)

<xarray.DataArray 'foo' (x: 4)>
dask.array<shape=(4,), dtype=int64, chunksize=(4,)>
Coordinates:
  * x        (x) int64 0 1 2 3
    bar      (x) <U1 dask.array<shape=(4,), chunksize=(2,)>
    baz      (x) <U1 dask.array<shape=(4,), chunksize=(2,)>

Problem description

When grouping over a non-contiguous variable (baz) the result has no chunks. That means that we can’t lazily access a single item without computing the whole array. This has major performance consequences that make it hard to calculate anomaly values in a more realistic case. What we really want to do is often something like

ds = xr.open_mfdataset('lots/of/files/*.nc')
ds_anom = ds.groupby('time.month').apply(lambda x: x - x.mean(dim='time)

It is currently impossible to do this lazily due to the issue described above.

Expected Output

We would like to preserve the original chunk structure of foo.

Output of `xr.show_versions()`

xr.show_versions() is triggering a segfault right now on my system for unknown reasons! I am using xarray 0.10.7.

About this issue

Original URL
State: closed
Created 6 years ago
Reactions: 1
Comments: 30 (30 by maintainers)

Links to this issue

flox · PyPI

Commits related to this issue

Enable `flox` in `GroupBy` and `resample` (#5734) Closes #5734 Closes #4473 Closes #4498 Closes #659 Closes #2237 xref https://github.com/pangeo-data/pangeo/issues/271 Co-authored-by: Anderson Banih... — committed to dcherian/xarray by dcherian 2 years ago
Enable `flox` in `GroupBy` and `resample` (#5734) Closes #5734 Closes #4473 Closes #4498 Closes #659 Closes #2237 xref https://github.com/pangeo-data/pangeo/issues/271 Co-authored-by: Anderson Banih... — committed to dcherian/xarray by dcherian 2 years ago
Enable flox in GroupBy and resample (#5734) Closes #5734 Closes #4473 Closes #4498 Closes #659 Closes #2237 xref https://github.com/pangeo-data/pangeo/issues/271 Co-authored-by: Anderson Banihirwe ... — committed to dcherian/xarray by dcherian 2 years ago

Most upvoted comments

And just because it’s fun, I will show what the anomaly calculation looks like

ds.foo.groupby('bar').apply(lambda x: x - x.mean()).data.visualize():

ds.foo.groupby('baz').apply(lambda x: x - x.mean()).data.visualize():

It looks like everything is really ok up until the very end, where all the tasks aggregate into a single getitem call.

rabernat on Jun 18, 2018

I’m glad to see that this has generated so much serious discussion and thought! I will try to catch up on it in the morning when I have some hope of understanding.

rabernat on Jun 20, 2018

With groupby in xarray, we have two main cases:

groupby with reduction – (e.g. ds.groupby('baz').mean(dim='x')). There is currently no problem here. The new dimension becomes baz and the array is chunked as {'baz': 1}.
groupby with no reduction – (e.g. ds.groubpy('baz').apply(lambda x: x - x.mean())). In this case, the point of the out-of-order indexing is actually to put the array back together in its original order. In my last example above, according to the dot graph, it looks like there are four chunks right up until the end. They just have to be re-ordered. I imagine this should be cheap and simple, but I am probably overlooking something.

Case 2 seems similar to @shoyer’s example: x[np.arange(4)[::-1]. Here we would just want to reorder the existing chunks.

If the chunk size before reindexing is not 1, then yes, one needs to do something more sophisticated. But I would argue that, if the array is being re-indexed along a dimension in which the chunk size is 1, a sensible default behavior would be to avoid aggregating into a big chunk and instead just pass the original chunks though in a new order.

rabernat on Jun 19, 2018

I think that it would be useful to consider many possible cases of how people might want to chunk dask arrays with out-of-order indices, and the desired chunking outputs. XArray users like those here can provide some of those use cases. We’ll have to gather others from other communities. Maybe once we have enough use cases gathered then rules for what correct behavior should be will emerge?

On Mon, Jun 18, 2018 at 5:16 PM Stephan Hoyer notifications@github.com wrote:

I vaguely recall discussing chunks that result from indexing somewhere in the dask issue tracker (when we added the special case for a monotonic increasing indexer to preserve chunks), but I can’t find it now.

I think the challenge is that it isn’t obvious what the right chunksizes should be. Chunks that are too small also have negative performance implications. Maybe the automatic chunking logic that @mrocklin https://github.com/mrocklin has been looking into recently would be relevant here.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pydata/xarray/issues/2237#issuecomment-398198466, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszJeod5WLFa94XQo_6AwKwBdSpC9Rks5t-Bi3gaJpZM4Ur8XO .

mrocklin on Jun 18, 2018

Thanks for the explanation @shoyer! Yes, that appears to be the root of the issue. After literally years of struggling with this, I am happy to finally get to this level of clarity.

So basically the issue comes down to indexing with dask.array, which creates a single chunk when integers indices are not all in order

Do we think dask is happy with that behavior? If not, then an upstream fix would be best. Pinging @mrocklin.

Otherwise we can try to work around in xarray.