xarray: why time grouping doesn't preserve chunks
Code Sample, a copy-pastable example if possible
I am continuing my quest to obtain more efficient time grouping for calculation of climatologies and climatological anomalies. I believe this is one of the major performance bottlenecks facing xarray users today. I have raised this in other issues (e.g. #1832), but I believe I have narrowed it down here to a more specific problem.
The easiest way to summarize the problem is with an example. Consider the following dataset
import xarray as xr
ds = xr.Dataset({'foo': (['x'], [1, 1, 1, 1])},
coords={'x': (['x'], [0, 1, 2, 3]),
'bar': (['x'], ['a', 'a', 'b', 'b']),
'baz': (['x'], ['a', 'b', 'a', 'b'])})
ds = ds.chunk({'x': 2})
ds
<xarray.Dataset>
Dimensions: (x: 4)
Coordinates:
* x (x) int64 0 1 2 3
bar (x) <U1 dask.array<shape=(4,), chunksize=(2,)>
baz (x) <U1 dask.array<shape=(4,), chunksize=(2,)>
Data variables:
foo (x) int64 dask.array<shape=(4,), chunksize=(2,)>
One non-dimension coordinate (bar
) is contiguous with respect to x
while the other baz
is not. This is important. baz
is structured similar to the way that month
would be distributed on a timeseries dataset.
Now let’s do a trivial groupby operation on bar
that does nothing, just returns the group unchanged:
ds.foo.groupby('bar').apply(lambda x: x)
<xarray.DataArray 'foo' (x: 4)>
dask.array<shape=(4,), dtype=int64, chunksize=(2,)>
Coordinates:
* x (x) int64 0 1 2 3
bar (x) <U1 dask.array<shape=(4,), chunksize=(2,)>
baz (x) <U1 dask.array<shape=(4,), chunksize=(2,)>
This operation preserved this original chunks in foo
. But if we group by baz
we see something different
ds.foo.groupby('baz').apply(lambda x: x)
<xarray.DataArray 'foo' (x: 4)>
dask.array<shape=(4,), dtype=int64, chunksize=(4,)>
Coordinates:
* x (x) int64 0 1 2 3
bar (x) <U1 dask.array<shape=(4,), chunksize=(2,)>
baz (x) <U1 dask.array<shape=(4,), chunksize=(2,)>
Problem description
When grouping over a non-contiguous variable (baz
) the result has no chunks. That means that we can’t lazily access a single item without computing the whole array. This has major performance consequences that make it hard to calculate anomaly values in a more realistic case. What we really want to do is often something like
ds = xr.open_mfdataset('lots/of/files/*.nc')
ds_anom = ds.groupby('time.month').apply(lambda x: x - x.mean(dim='time)
It is currently impossible to do this lazily due to the issue described above.
Expected Output
We would like to preserve the original chunk structure of foo
.
Output of xr.show_versions()
xr.show_versions()
is triggering a segfault right now on my system for unknown reasons! I am using xarray 0.10.7.
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Reactions: 1
- Comments: 30 (30 by maintainers)
Links to this issue
Commits related to this issue
- Enable `flox` in `GroupBy` and `resample` (#5734) Closes #5734 Closes #4473 Closes #4498 Closes #659 Closes #2237 xref https://github.com/pangeo-data/pangeo/issues/271 Co-authored-by: Anderson Banih... — committed to dcherian/xarray by dcherian 2 years ago
- Enable `flox` in `GroupBy` and `resample` (#5734) Closes #5734 Closes #4473 Closes #4498 Closes #659 Closes #2237 xref https://github.com/pangeo-data/pangeo/issues/271 Co-authored-by: Anderson Banih... — committed to dcherian/xarray by dcherian 2 years ago
- Enable flox in GroupBy and resample (#5734) Closes #5734 Closes #4473 Closes #4498 Closes #659 Closes #2237 xref https://github.com/pangeo-data/pangeo/issues/271 Co-authored-by: Anderson Banihirwe ... — committed to dcherian/xarray by dcherian 2 years ago
And just because it’s fun, I will show what the anomaly calculation looks like
ds.foo.groupby('bar').apply(lambda x: x - x.mean()).data.visualize()
:ds.foo.groupby('baz').apply(lambda x: x - x.mean()).data.visualize()
:It looks like everything is really ok up until the very end, where all the tasks aggregate into a single
getitem
call.I’m glad to see that this has generated so much serious discussion and thought! I will try to catch up on it in the morning when I have some hope of understanding.
With groupby in xarray, we have two main cases:
ds.groupby('baz').mean(dim='x')
). There is currently no problem here. The new dimension becomesbaz
and the array is chunked as{'baz': 1}
.ds.groubpy('baz').apply(lambda x: x - x.mean())
). In this case, the point of the out-of-order indexing is actually to put the array back together in its original order. In my last example above, according to the dot graph, it looks like there are four chunks right up until the end. They just have to be re-ordered. I imagine this should be cheap and simple, but I am probably overlooking something.Case 2 seems similar to @shoyer’s example:
x[np.arange(4)[::-1]
. Here we would just want to reorder the existing chunks.If the chunk size before reindexing is not 1, then yes, one needs to do something more sophisticated. But I would argue that, if the array is being re-indexed along a dimension in which the chunk size is 1, a sensible default behavior would be to avoid aggregating into a big chunk and instead just pass the original chunks though in a new order.
I think that it would be useful to consider many possible cases of how people might want to chunk dask arrays with out-of-order indices, and the desired chunking outputs. XArray users like those here can provide some of those use cases. We’ll have to gather others from other communities. Maybe once we have enough use cases gathered then rules for what correct behavior should be will emerge?
On Mon, Jun 18, 2018 at 5:16 PM Stephan Hoyer notifications@github.com wrote:
Thanks for the explanation @shoyer! Yes, that appears to be the root of the issue. After literally years of struggling with this, I am happy to finally get to this level of clarity.
Do we think dask is happy with that behavior? If not, then an upstream fix would be best. Pinging @mrocklin.
Otherwise we can try to work around in xarray.