dask: Automatically coerce map_blocks results to correct dimension

I have a Dask array of 120 HD images built from a stack of delayed objects:

dask.array<stack, shape=(120, 1080, 1920), dtype=uint16, chunksize=(1, 1080, 1920)>

and I am trying to apply a function, which returns an int64, to each image using map_blocks:

floc_proxy = dsa.map_blocks(calc_floc_proxy, delayed_frame_array, d1, d2, n, dtype='i8', drop_axis=[1,2])
floc_proxy
dask.array<calc_floc_proxy, shape=(120,), dtype=int64, chunksize=(1,)>

when I do floc_proxy.compute(), it runs through all the calculations for each image just fine, but fails at the very end with the following error:

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-102-4aab0a1a8291> in <module>()
----> 1 floc_proxy.compute()

/opt/conda/lib/python3.6/site-packages/dask/base.py in compute(self, **kwargs)
    153         dask.base.compute
    154         """
--> 155         (result,) = compute(self, traverse=False, **kwargs)
    156         return result
    157 

/opt/conda/lib/python3.6/site-packages/dask/base.py in compute(*args, **kwargs)
    405     results_iter = iter(results)
    406     return tuple(a if f is None else f(next(results_iter), *a)
--> 407                  for f, a in postcomputes)
    408 
    409 

/opt/conda/lib/python3.6/site-packages/dask/base.py in <genexpr>(.0)
    405     results_iter = iter(results)
    406     return tuple(a if f is None else f(next(results_iter), *a)
--> 407                  for f, a in postcomputes)
    408 
    409 

/opt/conda/lib/python3.6/site-packages/dask/array/core.py in finalize(results)
    991     while isinstance(results2, (tuple, list)):
    992         if len(results2) > 1:
--> 993             return concatenate3(results)
    994         else:
    995             results2 = results2[0]

/opt/conda/lib/python3.6/site-packages/dask/array/core.py in concatenate3(arrays)
   3355     if not ndim:
   3356         return arrays
-> 3357     chunks = chunks_from_arrays(arrays)
   3358     shape = tuple(map(sum, chunks))
   3359 

/opt/conda/lib/python3.6/site-packages/dask/array/core.py in chunks_from_arrays(arrays)
   3181 
   3182     while isinstance(arrays, (list, tuple)):
-> 3183         result.append(tuple([shape(deepfirst(a))[dim] for a in arrays]))
   3184         arrays = arrays[0]
   3185         dim += 1

/opt/conda/lib/python3.6/site-packages/dask/array/core.py in <listcomp>(.0)
   3181 
   3182     while isinstance(arrays, (list, tuple)):
-> 3183         result.append(tuple([shape(deepfirst(a))[dim] for a in arrays]))
   3184         arrays = arrays[0]
   3185         dim += 1

IndexError: tuple index out of range

Any ideas what I’m doing wrong? Could this be a bug?

About this issue

  • Original URL
  • State: open
  • Created 6 years ago
  • Comments: 38 (24 by maintainers)

Most upvoted comments

I imagine this being similar to how dask dataframe handles the function apply_and_enforce in map_partitions, which serves the same role as map_blocks.

This line and the function definition a bit below are probably useful:

https://github.com/dask/dask/blob/279fdf7a6a78a1dfaa0974598aead3e1b44f9194/dask/dataframe/core.py#L3515

That would certainly be an improvement over the current state of affairs On Tue, Jun 12, 2018 at 5:41 AM Matthew Rocklin notifications@github.com wrote:

No objection to adding to the docstring or improving the error message. However depending on how this conversation goes we may also end up removing that fix immediately afterwards. Up to you if you want to expend the effort.

In this conversation people have listed a number of concerns and benefits from automatic coercion, but we are currently light on concrete proposals. I’m going to propose that for now we wrap map_blocks functions in a cleanup function that does the following:

  1. promote all inputs to the correct dimension
  2. warn if the dimension or shape is not as expected

I think that this is likely to improve user experience overall, while also pointing users towards correct behavior. @shoyer https://github.com/shoyer @jakirkham https://github.com/jakirkham how do you feel about this?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dask/dask/issues/3590#issuecomment-396575187, or mute the thread https://github.com/notifications/unsubscribe-auth/ABKS1mD6OINHMvWjSy26nT2jPwH78Kwjks5t77cJgaJpZM4UjGvH .

I think it is reasonable to require the user function to return an array with the correct dimensions based on chunks/drop/new

I think it’s a reasonable expectation to have, but I also expect to be disappointed when I have this expectation 😃

I’m not sure how it’s possible in general to expand an array to the “right number of dimensions”. I guess this could be done with broadcasting, always inserting new dimensions on the left, but this could be surprising and potentially make bugs harder to diagnose.

There is also the related problem of chunk sizes: map_blocks creates an array with fixed chunk sizes, which might change when the array is evaluated. So automatically expanding dimension wouldn’t work in general.

I think my preference would be for the informative error, with a compute-time check that verifies that each output has the expected shape based on chunks/drop_axis/new_axis. We could add an optional flag for disabling the check in performance critical code and tested library routines, e.g., check_output=False.