dask: Sparsely blocked/chunked arrays

I have an application whereby I am manually forming a dask array using blocks as each of the chunks in the final array need to go through a specific transform to ensure the correct alignment in the final dask array. Furthermore, not all the chunks in the array which is formed contain useful information, but some are just needed to ensure spatial alignment between their neighboring chunks, thus forming a Dask array with sparse blocks/chunks.

Currently I am using masked arrays to mask out entire chunks, however, this is inefficient as those blocks are still included in the computational graph and are still evaluated to some degree. Ultimately it would be nice to have the concept of chunk/block masks which let dask know that the evaluation of any operation on an entire chunk will result in a NaN or empty result and can thus be skipped entirely without additional computation.

The application generally has more empty blocks/chunks that those with valid data and thus the computational graph grows very large with a lot of unneeded operations, having a way to prevent this would be nice.

A small example of how I currently handle this:

import numpy as np
import dask.array as da

arr0 = da.from_array(np.arange(1, 26).reshape(5,5), chunks=(5, 5))
arr1 = da.from_array(np.arange(25, 50).reshape(5,5), chunks=(5, 5))
# This chunk is empty/blank/invalid but is needed to align the other chunks
arr2 = da.from_array(np.full((5,5), np.nan), chunks=(5, 5)) 
arr3 = da.from_array(np.arange(75, 100).reshape(5,5), chunks=(5, 5))

a = da.block([[arr0, arr1],[arr2, arr3]])
b = da.ma.masked_invalid(a)
c = b.min().compute()

What would be nice to have:

import numpy as np
import dask.array as da

arr0 = da.from_array(np.arange(1, 26).reshape(5,5), chunks=(5, 5))
arr1 = da.from_array(np.arange(25, 50).reshape(5,5), chunks=(5, 5))
arr3 = da.from_array(np.arange(75, 100).reshape(5,5), chunks=(5, 5))

a = da.sparse_block([[arr0, arr1],[None, arr3]])
b = a.min().compute()

About this issue

Original URL
State: open
Created 3 years ago
Reactions: 2
Comments: 15 (11 by maintainers)

Most upvoted comments

I’d suggest is_nodata or something along those lines to abstract it further away from any specific value (for instance, in my application, these values are often nan).

To avoid any impact to dense arrays, perhaps we want to subclass this into a sparse_array so that there is no performance hit for standard arrays (which I’d assume are a large percentage of use cases)

system123 on Aug 9, 2022

I am also registering my interest in such a feature! I have been trying both by using the sparse library (per this reference) and by using da.masked_array, but it does not appear that either reduces computation in the case of very sparse array operations (even just the example on the array-sparse page above runs slower on using sparse).

As for is_zero vs is_nodata, I am not sure these are entirely equivalent; in the case of elementwise multiplication, for instance, optimizing over an entire block of zeros is perfectly legitimate, and the result is mathematically zero; if that block is instead all nan, the desired result might be nan instead of zero, so I’m not sure if it should be generalized from zero to nan.

hansmohrmann on Aug 18, 2022

I don’t think they’re very similar. Right now https://github.com/dask/dask/pull/7655 is targeted only at low level slicing, not fancy indexing or masking. I’ll see what Ian & Gabe think about it when I talk to them next.

GenevieveBuckley on May 26, 2021

Perhaps what’s needed here is better culling of tasks for masked arrays?

GenevieveBuckley on May 14, 2021

Oh I missed that you are talking about zero-filled chunks. You are right. I think the term empty threw me because in dask.dataframe that refers to a dataframe with no rows.

jsignell on Aug 9, 2022

This is a small example, but with some awareness about emptiness in the graph below only 3 tasks would be needed:

from dask.array import eye

I = eye(3, chunks=1)

print("Number of tasks:", len((2 * I).dask))
# Number of tasks: 18

SultanOrazbayev on Aug 9, 2022

I would like to express interest in this feature also. To give more context, I am interested in performing some operations on large sparse arrays, however given the scale of the matrices the number of tasks grows excessively even though most of those tasks are useless (e.g. adding a block of zeros to another block of zeros).

One strategy is removing tasks from .dask graph based on the chunks/blocks that are known to contain zeros, but there might be a better way of dealing with this and avoid creating tasks that are known to be unnecessary.

Do you think it’s feasible to add a block-level boolean attribute to designate if the block is empty (or all zeros), e.g. self.block_is_empty (True/False/None for empty, non-empty, unknown)? Then, the status of each block can be used to reduce tasks or pre-compute them (e.g. element-wise multiplication of an empty block with a non-empty block is an empty block). If there is interest in this, I can work on a PR with some guidance.

SultanOrazbayev on Aug 8, 2022