dask: Automatically coerce map_blocks results to correct dimension
I have a Dask array of 120 HD images built from a stack of delayed objects:
dask.array<stack, shape=(120, 1080, 1920), dtype=uint16, chunksize=(1, 1080, 1920)>
and I am trying to apply a function, which returns an int64, to each image using map_blocks:
floc_proxy = dsa.map_blocks(calc_floc_proxy, delayed_frame_array, d1, d2, n, dtype='i8', drop_axis=[1,2])
floc_proxy
dask.array<calc_floc_proxy, shape=(120,), dtype=int64, chunksize=(1,)>
when I do floc_proxy.compute(), it runs through all the calculations for each image just fine, but fails at the very end with the following error:
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-102-4aab0a1a8291> in <module>()
----> 1 floc_proxy.compute()
/opt/conda/lib/python3.6/site-packages/dask/base.py in compute(self, **kwargs)
153 dask.base.compute
154 """
--> 155 (result,) = compute(self, traverse=False, **kwargs)
156 return result
157
/opt/conda/lib/python3.6/site-packages/dask/base.py in compute(*args, **kwargs)
405 results_iter = iter(results)
406 return tuple(a if f is None else f(next(results_iter), *a)
--> 407 for f, a in postcomputes)
408
409
/opt/conda/lib/python3.6/site-packages/dask/base.py in <genexpr>(.0)
405 results_iter = iter(results)
406 return tuple(a if f is None else f(next(results_iter), *a)
--> 407 for f, a in postcomputes)
408
409
/opt/conda/lib/python3.6/site-packages/dask/array/core.py in finalize(results)
991 while isinstance(results2, (tuple, list)):
992 if len(results2) > 1:
--> 993 return concatenate3(results)
994 else:
995 results2 = results2[0]
/opt/conda/lib/python3.6/site-packages/dask/array/core.py in concatenate3(arrays)
3355 if not ndim:
3356 return arrays
-> 3357 chunks = chunks_from_arrays(arrays)
3358 shape = tuple(map(sum, chunks))
3359
/opt/conda/lib/python3.6/site-packages/dask/array/core.py in chunks_from_arrays(arrays)
3181
3182 while isinstance(arrays, (list, tuple)):
-> 3183 result.append(tuple([shape(deepfirst(a))[dim] for a in arrays]))
3184 arrays = arrays[0]
3185 dim += 1
/opt/conda/lib/python3.6/site-packages/dask/array/core.py in <listcomp>(.0)
3181
3182 while isinstance(arrays, (list, tuple)):
-> 3183 result.append(tuple([shape(deepfirst(a))[dim] for a in arrays]))
3184 arrays = arrays[0]
3185 dim += 1
IndexError: tuple index out of range
Any ideas what I’m doing wrong? Could this be a bug?
About this issue
- Original URL
- State: open
- Created 6 years ago
- Comments: 38 (24 by maintainers)
I imagine this being similar to how dask dataframe handles the function
apply_and_enforce
in map_partitions, which serves the same role as map_blocks.This line and the function definition a bit below are probably useful:
https://github.com/dask/dask/blob/279fdf7a6a78a1dfaa0974598aead3e1b44f9194/dask/dataframe/core.py#L3515
That would certainly be an improvement over the current state of affairs On Tue, Jun 12, 2018 at 5:41 AM Matthew Rocklin notifications@github.com wrote:
I think it’s a reasonable expectation to have, but I also expect to be disappointed when I have this expectation 😃
I’m not sure how it’s possible in general to expand an array to the “right number of dimensions”. I guess this could be done with broadcasting, always inserting new dimensions on the left, but this could be surprising and potentially make bugs harder to diagnose.
There is also the related problem of chunk sizes:
map_blocks
creates an array with fixed chunk sizes, which might change when the array is evaluated. So automatically expanding dimension wouldn’t work in general.I think my preference would be for the informative error, with a compute-time check that verifies that each output has the expected shape based on chunks/drop_axis/new_axis. We could add an optional flag for disabling the check in performance critical code and tested library routines, e.g.,
check_output=False
.