dask: Questions about map_blocks and _meta

This is a general question but is more focused on lower-level dask development so I figured I’d ask it here. This has to do with the recent _meta changes from #2977, #4543, and others from @mrocklin @shoyer and @pentschev.

Starting with dask 2.0, some of my tests on my Satpy library are failing because I was checking how often my map_blocks’d function was being called. To determine the meta information, the function is now being called once for meta and once for actual usage from what I can tell.

So my main question is, is it possible to provide keyword arguments to map_blocks to say “this is the array type that will be returned by this function” or could it be changed to default to the type of the input arrays? Is there some way to skip this meta generation step? Would this interfere with the sparse/masked array support?

Relevant source code: https://github.com/dask/dask/blob/master/dask/array/utils.py#L118-L133

About this issue

  • Original URL
  • State: open
  • Created 5 years ago
  • Comments: 16 (16 by maintainers)

Most upvoted comments

Yeah I think that is the origin of .blocks. There are various use cases that benefit from grabbing individual chunks and working on them.

That said, with the addition of HLG, there is some benefit to rewriting things in terms of map_blocks instead.

IMO, we should add an argument to allow passing a user-defined meta, but the current default behavior should be kept.

We should definitely have function arguments to allow explicitly providing meta in any location where it would be inferred from a user-defined function. This is similar to the existing situation with dtype. Running user provided functions on bogus inputs to determine metadata should really be a last resort.

@pentschev could you kindly remind why it’s important to keep track of _meta at all on dask arrays? Which dask operations need to know the types of contained arrays?