pandarallel: Fails with "_wrap_applied_output() missing 1 required positional argument" where a simple pandas apply succeeds

Hello,

I’m using python 3.8.10 (anaconda distribution, GCC 7.5.10) in Ubuntu LTS 20 64bits x86

From my pip freeze:

pandarallel 1.5.2 pandas 1.3.0 numpy 1.20.3

I’m working with a dataFrame that looks like this one:

HoleID scaffold tpl strand base score tMean tErr modelPrediction ipdRatio coverage isboundary identificationQv context experiment isbegin_bondary isend_boundary isin_IES uniqueID No_known_IES_retention_this_CCS detailed_classif
1025444 70189477 scaffold_024_with_IES 688203 0 T 2 0.517 0.190 0.555 0.931 11 True NaN TTAAATAGAAATTAAAATCAGCTGC NM9_10 False False False NM9_10_70189477 False POTENTIALLY_RETAINED_MACIES_OUTIES
1025446 70189477 scaffold_024_with_IES 688204 0 A 4 1.347 0.367 1.251 1.077 13 True NaN TAAATAGAAATTAAAATCAGCTGCT NM9_10 False False False NM9_10_70189477 False POTENTIALLY_RETAINED_MACIES_OUTIES
1025448 70189477 scaffold_024_with_IES 688205 0 A 5 1.913 0.779 1.464 1.307 16 True NaN AAATAGAAATTAAAATCAGCTGCTT NM9_10 False False False NM9_10_70189477 False POTENTIALLY_RETAINED_MACIES_OUTIES
1025450 70189477 scaffold_024_with_IES 688206 0 A 4 1.535 0.712 1.328 1.156 18 True NaN AATAGAAATTAAAATCAGCTGCTTA NM9_10 False False False NM9_10_70189477 False POTENTIALLY_RETAINED_MACIES_OUTIES
1025452 70189477 scaffold_024_with_IES 688207 0 A 5 1.655 0.565 1.391 1.190 18 True NaN ATAGAAATTAAAATCAGCTGCTTAA NM9_10 False False False NM9_10_70189477 False POTENTIALLY_RETAINED_MACIES_OUTIES

I defined the following function

def get_distance_from_nearest_criteria(df,criteria):
    begins = df[df[criteria]].copy()
    
    if len(begins) == 0:
        return pd.Series([np.nan for x in range(len(df))])
    else:
        list_return = []

        for idx, nt in df.iterrows():
            distances = [abs(nt["tpl"] - x) for x in begins["tpl"]]
            mindistance = min(distances,default=np.nan)
            list_return.append(mindistance)

        return pd.Series(list_return)

Then using :

from pandarallel import pandarallel
pandarallel.initialize(progress_bar=False, nb_workers=12)
out = df.groupby(["uniqueID"]).parallel_apply(lambda x: get_distance_from_nearest_criteria(x,'isbegin_bondary'))

leads to :

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-49-02fc7c0589e3> in <module>
----> 1 out = df.groupby(["uniqueID"]).parallel_apply(lambda x: get_distance_from_nearest_criteria(x,'isbegin_bondary'))

~/conda3/envs/ies/lib/python3.8/site-packages/pandarallel/pandarallel.py in closure(data, func, *args, **kwargs)
    463             )
    464 
--> 465             return reduce(results, reduce_meta_args)
    466 
    467         finally:

~/conda3/envs/ies/lib/python3.8/site-packages/pandarallel/data_types/dataframe_groupby.py in reduce(results, df_grouped)
     14         keys, values, mutated = zip(*results)
     15         mutated = any(mutated)
---> 16         return df_grouped._wrap_applied_output(
     17             keys, values, not_indexed_same=df_grouped.mutated or mutated
     18         )

TypeError: _wrap_applied_output() missing 1 required positional argument: 'values'

For me, the error is not clear enough (I can’t tell what’s happening)

However, when I run it with a simple pandas apply :

uniqueID           
HT2_10354935    0      297.0
                1      297.0
                2      296.0
                3      296.0
                4      295.0
                       ...  
NM9_10_9568952  502      NaN
                503      NaN
                504      NaN
                505      NaN
                506      NaN
Length: 1028437, dtype: float64

I’m running all of this in a jupyter notebook

ipykernel 5.3.4 ipython 7.22.0 ipython-genutils 0.2.0 notebook 6.4.0 jupyter 1.0.0 jupyter-client 6.1.12 jupyter-console 6.4.0 jupyter-core 4.7.1 jupyter-dash 0.4.0 jupyterlab-pygments 0.1.2 jupyterlab-widgets 1.0.0

I was wondering if someone could explain me what’s hapenning, and how to fix it if the error is mine. Because it works out of the box with a simple pandas apply, I suppose that there is a small problem in pandarallel

NB: Note also that this code leaves unkilled processes even after I interrupted or restarted the ipython kernel EDIT: Would it be linked to the fact that I’m using a lambda function ?

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 8
  • Comments: 18 (5 by maintainers)

Commits related to this issue

Most upvoted comments

One solution is that you could use the pandas version before v1.3.0, for example v1.2.5.

>>> import numpy as np
>>> import pandas as pd
>>> pd.__version__
'1.2.5'
>>> from pandarallel import pandarallel
>>> pandarallel.initialize()
INFO: Pandarallel will run on 12 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.
>>> df = pd.DataFrame(np.random.rand(10, 2), columns=['a', 'b'])
>>> df
          a         b
0  0.632378  0.427258
1  0.814948  0.639748
2  0.701467  0.890010
3  0.803045  0.685235
4  0.749729  0.295159
5  0.588197  0.840467
6  0.707125  0.613361
7  0.027530  0.678850
8  0.468288  0.515698
9  0.824416  0.839627
>>> df.groupby('a').apply(lambda grp: grp)
          a         b
0  0.632378  0.427258
1  0.814948  0.639748
2  0.701467  0.890010
3  0.803045  0.685235
4  0.749729  0.295159
5  0.588197  0.840467
6  0.707125  0.613361
7  0.027530  0.678850
8  0.468288  0.515698
9  0.824416  0.839627
>>> df.groupby('a').parallel_apply(lambda grp: grp)
          a         b
0  0.632378  0.427258
1  0.814948  0.639748
2  0.701467  0.890010
3  0.803045  0.685235
4  0.749729  0.295159
5  0.588197  0.840467
6  0.707125  0.613361
7  0.027530  0.678850
8  0.468288  0.515698
9  0.824416  0.839627

For version after v1.3.0, the _wrap_applied_output function inside pandas/core/groupby/groupby.py add one positional argument data, therefore, causing this problem.

Screen Shot 2021-08-11 at 5 00 48 PM

I have the excatly same problem with a normal defined function. And no idea to fix it that seems like a bug in pandarallel.

This is the commit where it changed https://github.com/pandas-dev/pandas/commit/3408a61ff940900ed1aa5fb89ee92635938d2e94

So downgrading to <1.3 will “fix” it

I am also having this issue

@Kr4t0n OIC, I met this issue just after a upgrading but this confliction is caused from a core library’s dependency.

This issue still exists in 1.5.4 with pandas 1.4.1.