distributed: Bug introduced in distributed 2022.5.1 -- unpack_frames

What happened:

Seems like potentially two bugs related to each other or one bug that is causing two different flakey behaviors.

First bug is that the compute call is returning an array asking for PiB of memory: https://github.com/AllenCellModeling/aicsimageio/runs/6594900687?check_suite_focus=true#step:8:137

Second bug is that the unpack_frames is requiring a buffer of a certain length but isn’t receiving it: https://github.com/AllenCellModeling/aicsimageio/runs/6594900687?check_suite_focus=true#step:8:709

Ultimately both bugs use the unpack_frames function in the traceback though.

image

What you expected to happen:

This exact same code ran fine on 2022.5.0: https://github.com/AllenCellModeling/aicsimageio/runs/6563614927?check_suite_focus=true

image

(I also checked the underlying file reading library version tifffile and it stays the same 2022.5.4 for both the PR CI and the post merge CI.)

Minimal Complete Verifiable Example:

Unlike the last bug I caught via CI (#6255) this one is seems deep in the weeds of worker magic but I do think I have a cuplrit from the changelog.

  • https://github.com/dask/distributed/pull/6333 – the actual code for host_array in utils was changed in 2022.5.0 but not in 2022.5.1 however there was a change made in that PR ensures a memory view for each frame to unpack. Memory view -> incorrect allocation size / buffer size seems like a likely culprit?

Anything else we need to know?:

Environment:

  • Dask version: 2022.5.1
  • Python version: CI tests on 3.8, 3.9, and 3.10, all are failing
  • Operating System: CI tests on Windows, MacOS, and Ubuntu – MacOS tests fail fast, the others tests are taking 3 hours so I am assuming something is going wrong with them too.
  • Install method (conda, pip, source): pip
Cluster Dump State:

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 29 (20 by maintainers)

Most upvoted comments

Thanks @JacksonMaxfield – I can reproduce with main and I’m bisecting now

We have a working theory now. Trying to get it down to a reproducer

Thanks Jackson!

On Thu, May 26, 2022 at 3:35 PM Jackson Maxfield Brown < @.***> wrote:

aicsimageio now has an workflow to test on upstreams main. Hopefully I can spot these issues prior to release next time. Thanks again everyone ❤️

— Reply to this email directly, view it on GitHub https://github.com/dask/distributed/issues/6448#issuecomment-1139009010, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTGIITVVGJ7NQJXIZSDVL7OCTANCNFSM5W55PWDQ . You are receiving this because you were mentioned.Message ID: @.***>

aicsimageio now has an workflow to test on upstreams main. Hopefully I can spot these issues prior to release next time. Thanks again everyone ❤️

Figured out a test that covered the case and included it in the PR. Should be good to go

All test-upstreams checks pass jakirkhams branch 👍

Crossposting: @jakirkham’s branch so far seems to fix the problem. Tests here: https://github.com/AllenCellModeling/aicsimageio/runs/6600771089?check_suite_focus=true

(not all tests are important in this case, just the test-upstreams)

I have converted https://github.com/AllenCellModeling/aicsimageio/pull/406 into a PR to add an upstream testing job to aicsimageio.

For provenance: here are the PR checks when it was testing mrocklins reversion branch: https://github.com/AllenCellModeling/aicsimageio/runs/6599766163?check_suite_focus=true

🟢

Pair debugging with Ben