dask: Sparsely blocked/chunked arrays
I have an application whereby I am manually forming a dask array using blocks as each of the chunks in the final array need to go through a specific transform to ensure the correct alignment in the final dask array. Furthermore, not all the chunks in the array which is formed contain useful information, but some are just needed to ensure spatial alignment between their neighboring chunks, thus forming a Dask array with sparse blocks/chunks.
Currently I am using masked arrays to mask out entire chunks, however, this is inefficient as those blocks are still included in the computational graph and are still evaluated to some degree. Ultimately it would be nice to have the concept of chunk/block masks which let dask know that the evaluation of any operation on an entire chunk will result in a NaN or empty result and can thus be skipped entirely without additional computation.
The application generally has more empty blocks/chunks that those with valid data and thus the computational graph grows very large with a lot of unneeded operations, having a way to prevent this would be nice.
A small example of how I currently handle this:
import numpy as np
import dask.array as da
arr0 = da.from_array(np.arange(1, 26).reshape(5,5), chunks=(5, 5))
arr1 = da.from_array(np.arange(25, 50).reshape(5,5), chunks=(5, 5))
# This chunk is empty/blank/invalid but is needed to align the other chunks
arr2 = da.from_array(np.full((5,5), np.nan), chunks=(5, 5))
arr3 = da.from_array(np.arange(75, 100).reshape(5,5), chunks=(5, 5))
a = da.block([[arr0, arr1],[arr2, arr3]])
b = da.ma.masked_invalid(a)
c = b.min().compute()
What would be nice to have:
import numpy as np
import dask.array as da
arr0 = da.from_array(np.arange(1, 26).reshape(5,5), chunks=(5, 5))
arr1 = da.from_array(np.arange(25, 50).reshape(5,5), chunks=(5, 5))
arr3 = da.from_array(np.arange(75, 100).reshape(5,5), chunks=(5, 5))
a = da.sparse_block([[arr0, arr1],[None, arr3]])
b = a.min().compute()
About this issue
- Original URL
- State: open
- Created 3 years ago
- Reactions: 2
- Comments: 15 (11 by maintainers)
I’d suggest
is_nodataor something along those lines to abstract it further away from any specific value (for instance, in my application, these values are oftennan).To avoid any impact to dense arrays, perhaps we want to subclass this into a
sparse_arrayso that there is no performance hit for standard arrays (which I’d assume are a large percentage of use cases)I am also registering my interest in such a feature! I have been trying both by using the
sparselibrary (per this reference) and by usingda.masked_array, but it does not appear that either reduces computation in the case of very sparse array operations (even just the example on thearray-sparsepage above runs slower on usingsparse).As for
is_zerovsis_nodata, I am not sure these are entirely equivalent; in the case of elementwise multiplication, for instance, optimizing over an entire block ofzerosis perfectly legitimate, and the result is mathematically zero; if that block is instead allnan, the desired result might benaninstead ofzero, so I’m not sure if it should be generalized from zero to nan.I don’t think they’re very similar. Right now https://github.com/dask/dask/pull/7655 is targeted only at low level slicing, not fancy indexing or masking. I’ll see what Ian & Gabe think about it when I talk to them next.
Perhaps what’s needed here is better culling of tasks for masked arrays?
Oh I missed that you are talking about zero-filled chunks. You are right. I think the term
emptythrew me because in dask.dataframe that refers to a dataframe with no rows.This is a small example, but with some awareness about emptiness in the graph below only 3 tasks would be needed:
I would like to express interest in this feature also. To give more context, I am interested in performing some operations on large sparse arrays, however given the scale of the matrices the number of tasks grows excessively even though most of those tasks are useless (e.g. adding a block of zeros to another block of zeros).
One strategy is removing tasks from
.daskgraph based on the chunks/blocks that are known to contain zeros, but there might be a better way of dealing with this and avoid creating tasks that are known to be unnecessary.Do you think it’s feasible to add a block-level boolean attribute to designate if the block is
empty(or all zeros), e.g.self.block_is_empty(True/False/None for empty, non-empty, unknown)? Then, the status of each block can be used to reduce tasks or pre-compute them (e.g. element-wise multiplication of an empty block with a non-empty block is an empty block). If there is interest in this, I can work on a PR with some guidance.