distributed: Distributed 2021.3.1 `distributed.protocol.serialize.pickle_loads` fails with `IndexError: tuple index out of range`

What happened:

The following exception occred with the latest version of distributed, in a test that has previously passed:

header = {'compression': (None, None), 'num-sub-frames': 2, 'serializer': 'pickle', 'split-num-sub-frames': (1, 1), ...}
frames = [<memory at 0x1209deae0>, <memory at 0x1209dea10>]
    def pickle_loads(header, frames):
        x, buffers = frames[0], frames[1:]
        writeable = header["writeable"]
        for i in range(len(buffers)):
            mv = memoryview(buffers[i])
>           if writeable[i] == mv.readonly:
E           IndexError: tuple index out of range

“writeable” is an empty tuple in the above header.

What you expected to happen:

After digging a bit and comparing runs of the same test between 2021.3.0 and 2021.3.1, I found the following:

In version 2021.3.0 the input frames always has one element, hence buffers is always an empty list --> so the for loop, which contains writeable[i] never runs; writable is always an empty tuple

In version 2021.3.1 the third time it gets to this function, frames has 2 elements, hence buffers is not empty, and the for loop is executed; writable is still an empty tuple, hence code fails.

I saw that there were substantial changes to distributed.protocol.core.loads, where frames is passed down in its “truncated” from (sub_frames) to the function which eventually breaks. I don’t know if this is a bug introduced, or our code needs changing. I’m not familiar with the underlying mechanisms, so I’d appreciate if someone could take a look.

Environment:

Dask version: 2021.3.1
Python version: 3.7.10
Operating System: MacOS Mojave (but also fails on linux-based gitlab runners)
Install method (conda, pip, source): pip

About this issue

Original URL
State: closed
Created 3 years ago
Reactions: 2
Comments: 43 (27 by maintainers)

Commits related to this issue

Update `dask` + `distributed` to `2021.4.0` (#7858) Needed to pick up some serialization bug fixes in the recent Distributed release ( https://github.com/dask/distributed/issues/4645 ) ( https://gith... — committed to rapidsai/cudf by jakirkham 3 years ago
Update `dask` + `distributed` to `2021.4.0` (#7858) Needed to pick up some serialization bug fixes in the recent Distributed release ( https://github.com/dask/distributed/issues/4645 ) ( https://gith... — committed to shwina/cudf by jakirkham 3 years ago

Most upvoted comments

Thank you everyone who participated in helping to track this down. I appreciate it.

mrocklin on Apr 2, 2021

Thanks @alejandrofiel, however since others can’t access the CSV files you’re using, this makes it difficult for us to debug. See https://blog.dask.org/2018/02/28/minimal-bug-reports for some information on crafting minimal bug reports

jrbourbeau on Mar 31, 2021

@williamBlazing I see that you’ve also reported something similar. If you or your team are able to help provide a reproducer that would be welcome.

mrocklin on Mar 31, 2021

Thanks let’s track this in issue ( https://github.com/dask/distributed/issues/4662 ). It appears we’ve addressed the original issue and one variant ( https://github.com/dask/distributed/issues/4645#issuecomment-810117759 )

jakirkham on Apr 1, 2021

Could you please test with Python 3.8?

When I run the test on a python3.8 environemnt, I get this error:

    def loads(x, *, buffers=()):
        try:
            if buffers:
>               return pickle.loads(x, buffers=buffers)
E               _pickle.UnpicklingError: pickle data was truncated

gabicca on Apr 1, 2021

Thanks for the update. What does new contain in this part of the exception?
  File "/home/cedric/venv/lib/python3.7/site-packages/distributed/protocol/serialize.py", line 80, in pickle_loads
    return pickle.loads(x, buffers=new)

I have printed new during execution and my console showed [] before raising the error. The pickle_loads method was not called before this.

Cedric-Magnan on Mar 31, 2021

IIUC Ben was referring to using Dask + Distributed coming from main (not NumPy). The NumPy comment was in relation to the new issue Ben found. Would suggest trying Dask + Distributed from main (instead of the last tag), Cedric

jakirkham on Mar 31, 2021

@williamBlazing I see that you’ve also reported something similar. If you or your team are able to help provide a reproducer that would be welcome.

@mrocklin our reproducer is here https://github.com/dask/dask/issues/7490

wmalpica on Mar 31, 2021

Hi Everyone,

Thank you for reporting this. We’ll get a fix in soon and issue a bugfix release.

However, it would be really helpful to develop a test here to ensure that this problem doesn’t recur in the future. To the extent that people are able to reduce their examples to make them more reproducible by others that would be welcome. None of the test cases in the current test suite run into this problem, and so we’re a bit blind at the moment.

For example, all of the stated examples talk about reading from S3. Does this problem occur if you’re not reading data from S3? Does it require gzip? Does it require all of the keyword parameters that you’re passing in? Does it go away if you remove a specific one of these? @Cedric-Magnan @alejandrofiel @gabicca you all are currently the best placed people to help us identify the challenge here. If you are able to reduce your problem to something that someone else can run that would be very helpful.

Given the information that you’ve provided so far I’ve tried to reproduce this issue by reading a CSV data from a public S3 dataset. This is the code that I use for that.

from dask.distributed import Client
client = Client()

df = dd.read_csv(
    "s3://nyc-tlc/trip data/yellow_tripdata_2019-*.csv",
    storage_options={"anon": True},
)
df.head()

Does this fail for you by any chance?

If so, great, can you share more about your environment to help us reproduce your setup. Maybe a pip freeze or conda list output. If you have the time to craft a minimal environment then even better
If not, then are you able to try to figure out what is different between what you’re doing in your failing example and what this example is doing? Is it the dataset? Some keyword argument?

Thanks all

mrocklin on Mar 31, 2021

Are you able to run with pdb @gabicca? It would be interesting to know what writeable is here

Hi @jakirkham , writeabel is always an empty tuple, as I say in the description, hence the index error. The difference between the two executions, from what I can tell, was that while in the old version frames is always a single-element list, in the new version sometimes it has multiple elements. So in the first virsion the code never enterred the for loop, because buffers was an empty list. While in the new version, it enterred the loop and failed with writeable being empty.

gabicca on Mar 30, 2021

Thanks for raising an issue @gabicca. cc @jakirkham @madsbk

Are you able to provide a code snippet which reproduces the IndexError?

I don’t really understand the code to be honest, to get a working quick example. But I will spend a bit more time on it tomorrow and try to come up with one. But please don’t wait for me with this. I’ll let you know how it goes.

gabicca on Mar 29, 2021