distributed: Distributed 2021.3.1 `distributed.protocol.serialize.pickle_loads` fails with `IndexError: tuple index out of range`
What happened:
The following exception occred with the latest version of distributed, in a test that has previously passed:
header = {'compression': (None, None), 'num-sub-frames': 2, 'serializer': 'pickle', 'split-num-sub-frames': (1, 1), ...}
frames = [<memory at 0x1209deae0>, <memory at 0x1209dea10>]
def pickle_loads(header, frames):
x, buffers = frames[0], frames[1:]
writeable = header["writeable"]
for i in range(len(buffers)):
mv = memoryview(buffers[i])
> if writeable[i] == mv.readonly:
E IndexError: tuple index out of range
“writeable” is an empty tuple in the above header.
What you expected to happen:
After digging a bit and comparing runs of the same test between 2021.3.0 and 2021.3.1, I found the following:
In version 2021.3.0
the input frames always has one element, hence buffers is always an empty list --> so the for loop, which contains writeable[i] never runs; writable is always an empty tuple
In version 2021.3.1
the third time it gets to this function, frames has 2 elements, hence buffers is not empty, and the for loop is executed; writable is still an empty tuple, hence code fails.
I saw that there were substantial changes to distributed.protocol.core.loads, where frames is passed down in its “truncated” from (sub_frames) to the function which eventually breaks. I don’t know if this is a bug introduced, or our code needs changing. I’m not familiar with the underlying mechanisms, so I’d appreciate if someone could take a look.
Environment:
- Dask version: 2021.3.1
- Python version: 3.7.10
- Operating System: MacOS Mojave (but also fails on linux-based gitlab runners)
- Install method (conda, pip, source): pip
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 2
- Comments: 43 (27 by maintainers)
Commits related to this issue
- Update `dask` + `distributed` to `2021.4.0` (#7858) Needed to pick up some serialization bug fixes in the recent Distributed release ( https://github.com/dask/distributed/issues/4645 ) ( https://gith... — committed to rapidsai/cudf by jakirkham 3 years ago
- Update `dask` + `distributed` to `2021.4.0` (#7858) Needed to pick up some serialization bug fixes in the recent Distributed release ( https://github.com/dask/distributed/issues/4645 ) ( https://gith... — committed to shwina/cudf by jakirkham 3 years ago
Thank you everyone who participated in helping to track this down. I appreciate it.
Thanks @alejandrofiel, however since others can’t access the CSV files you’re using, this makes it difficult for us to debug. See https://blog.dask.org/2018/02/28/minimal-bug-reports for some information on crafting minimal bug reports
@williamBlazing I see that you’ve also reported something similar. If you or your team are able to help provide a reproducer that would be welcome.
Thanks let’s track this in issue ( https://github.com/dask/distributed/issues/4662 ). It appears we’ve addressed the original issue and one variant ( https://github.com/dask/distributed/issues/4645#issuecomment-810117759 )
When I run the test on a python3.8 environemnt, I get this error:
I have printed
newduring execution and my console showed[]before raising the error. Thepickle_loadsmethod was not called before this.IIUC Ben was referring to using Dask + Distributed coming from
main(not NumPy). The NumPy comment was in relation to the new issue Ben found. Would suggest trying Dask + Distributed frommain(instead of the last tag), Cedric@mrocklin our reproducer is here https://github.com/dask/dask/issues/7490
Hi Everyone,
Thank you for reporting this. We’ll get a fix in soon and issue a bugfix release.
However, it would be really helpful to develop a test here to ensure that this problem doesn’t recur in the future. To the extent that people are able to reduce their examples to make them more reproducible by others that would be welcome. None of the test cases in the current test suite run into this problem, and so we’re a bit blind at the moment.
For example, all of the stated examples talk about reading from S3. Does this problem occur if you’re not reading data from S3? Does it require gzip? Does it require all of the keyword parameters that you’re passing in? Does it go away if you remove a specific one of these? @Cedric-Magnan @alejandrofiel @gabicca you all are currently the best placed people to help us identify the challenge here. If you are able to reduce your problem to something that someone else can run that would be very helpful.
Given the information that you’ve provided so far I’ve tried to reproduce this issue by reading a CSV data from a public S3 dataset. This is the code that I use for that.
Does this fail for you by any chance?
Thanks all
Hi @jakirkham ,
writeabelis always an empty tuple, as I say in the description, hence the index error. The difference between the two executions, from what I can tell, was that while in the old versionframesis always a single-element list, in the new version sometimes it has multiple elements. So in the first virsion the code never enterred the for loop, becausebufferswas an empty list. While in the new version, it enterred the loop and failed withwriteablebeing empty.I don’t really understand the code to be honest, to get a working quick example. But I will spend a bit more time on it tomorrow and try to come up with one. But please don’t wait for me with this. I’ll let you know how it goes.