dask: `groupby` does not seem to work for non-numerics
What happened:
Following along tutorial bag, I played around with it for better understanding:
b = db.from_sequence(list(range(10)))
b.groupby(lambda x: 'even' if (x % 2) == 0 else 'odd').compute()
Result:
[('odd', [3, 7]), ('even', [0, 2, 4, 6, 8]), ('odd', [1, 5, 9])]
What you expected to happen:
I expected:
[('odd', [1, 3, 5, 7, 9]), ('even', [0, 2, 4, 6, 8])]
Minimal Complete Verifiable Example:
Shown above.
Anything else we need to know?:
I tried to think of what could be causing this difference, so that I can understand what to watch out for, but could not manage:
- Maybe it was because the return object needs to be the same (pointing to the same object, not just the same value).
- But the string literal should be the same object. Just to be certain, I added variables
evenandodd, defined globally, and still got the strange result.
- But the string literal should be the same object. Just to be certain, I added variables
- Maybe it was due to partitioning.
- First, this seems strange, since
groupbyAPI specifically states that it makes the expensive traversal across all partitions. - Second, this is non-ideal: a major advantage is to allow me, as the user, to needn’t worry about partition details.
- Third, why does it work fine when the return value (the groupby key) is a number?
- First, this seems strange, since
Environment:
- Dask version: ‘2.20.0’
- Python version: ‘3.8.5’
- Operating System: ‘Mac OS 10.15.6’
- Install method (conda, pip, source): conda, specifically using the
environment.ymlas specified in the tutorial:conda env create -f binder/environment.yml
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 15 (10 by maintainers)
Commits related to this issue
- Fix #6640. — committed to itamarst/dask by itamarst 4 years ago
- Adds test_bag.py tests to assist with issue #6640. — committed to itamarst/dask by deleted user 4 years ago
- Adds test_bag.py tests to assist with issue #6640, formatted with black. — committed to itamarst/dask by deleted user 4 years ago
- Fix #6640. — committed to itamarst/dask by itamarst 4 years ago
- Fix #6640 by making sure subprocesses have a consistent hash. — committed to itamarst/dask by deleted user 4 years ago
- Make sure subprocesses have a consistent hash for bag groupby (#6660) Fixes #6640 Co-authored-by: Mike Williamson <mvw@kapacity.dk> — committed to dask/dask by itamarst 4 years ago
- Make sure subprocesses have a consistent hash for bag groupby (#6660) Fixes #6640 Co-authored-by: Mike Williamson <mvw@kapacity.dk> — committed to kumarprabhu1988/dask by itamarst 4 years ago
@jsignell Oh, now I understand: you want me to write the unit test, not the fix. OK, I’m on it. 👍
cc @itamarst due to the git bisect.
A process-dependent hash could be the problem here. I thought that we had this solved at one point.
We need to find a way to use seed hash consistently, or we need to find a better hashing function that is still fast and handles arbitrary Python objects.
Again I think that we may have solved this problem at one point. Grepping through logs might be informative.
Did a bisect that turned up this commit:
ae42c727f1fe75be7a7386f5014e48c4d5734004 is the first bad commit commit ae42c727f1fe75be7a7386f5014e48c4d5734004 Author: Itamar Turner-Trauring itamar@itamarst.org Date: Mon Apr 27 11:12:25 2020 -0500
dask/multiprocessing.py | 18 ++++++±---------- dask/tests/test_multiprocessing.py | 26 +++++++++++++++±--------- docs/source/scheduler-overview.rst | 3 +± 3 files changed, 25 insertions(+), 22 deletions(-)
PR https://github.com/dask/dask/pull/4003
Thanks for opening such a well-described issue. I can produce but I don’t know quite what it going on either. Maybe @jrbourbeau has some insight.