dask: ValueError: cannot handle a non-unique multi-index!

dask_error.txt

This might be related to Dask not supporting multi-indexes. My code was randomly failing, which made me first assume there was a problem in the input data. Running with the versions

dask: 1.2.2 numpy: 1.16.3 pandas: 0.24.2

the minimal example below fails. Is there a way of making this error message more intuitive? Or having this operation working.

import numpy as np
import pandas as pd
import dask.dataframe as dd

df = pd.DataFrame({'ind_a': np.arange(100), 'ind_b': 1, 'var': 'whatever'})
df = dd.from_pandas(df, npartitions=90)
    
# Only fails when grouping with two variables.
df['nr'] = df.groupby(['ind_a', 'ind_b']).cumcount()
len(df)

About this issue

  • Original URL
  • State: open
  • Created 5 years ago
  • Reactions: 2
  • Comments: 27 (12 by maintainers)

Most upvoted comments

definitely still seeing an issue. note as a workaround, I was able to specify a single column as a concat of the two values being grouped on and get things to move forward and around the issue

old code:

ddf['NEW_VALUE'] = ddf.groupby(['GROUP_KEY_1', 'GROUP_KEY_2'])['value'].cumsum()

new code to work around the issue:

ddf['GROUPER'] = ddf['GROUP_KEY_1'].astype(str) + ddf['GROUP_KEY_2'].astype(str)
ddf['NEW_VALUE'] = ddf.groupby(['GROUPER'])['value'].cumsum()

Few years later, and this is still an error. I’m using Dask 2023.7.1 and this error happened to me. It was a pain in the ass to finally see that this is a stochastic behavior and sometimes the code works and sometimes it doesn’t. Even trying Dask 2024.3.1 (which is the latest at the point I’m writing this), this still happens.

Does someone have an update on this error? How can an issue be open for such a long time? =/

@jakirkham

Creating a new environment with: dask: 2.20.0 numpy: 1.18.5 pandas: 1.0.5

I can confirm the error is still there. Testing this last year, we could localize where the problem came from, but without determining the best way to develop a fix.