pyjanitor: Problems with equalities in contional_join

Brief Description

I don’t know if I’m doing this correctly but, I want to apply 2 equalities and 2 inequalities (dates). If I apply one equality it doesn’t return an error but, when I apply both equalities an error is returned.

System Information

Operating system: Windows
OS details : 10
Python version: 3.9.18

Minimally Reproducible Code

import pandas as pd
import janitor
import random

df1 = pd.DataFrame({
    'index1': ['A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'C', 'C'],
    'index2': [1, 1, 1, 1, 1, 1, 2, 2, 2, 1, 1],
    'date': pd.to_datetime([
        '2024-01-01', '2024-01-10', '2024-04-01',
        '2024-01-01', '2024-01-10', '2024-04-01',
        '2024-01-01', '2024-01-10', '2024-04-01',
        '2021-10-01', '2021-11-01'
    ])
})
df1['value'] = [random.randrange(1, 50, 1) for i in range(df1.shape[0])]

df2 = pd.DataFrame({
    'index1': ['A', 'B', 'B', 'C'],
    'index2': [1, 1, 2, 1],
    'date_min': pd.to_datetime(['2024-01-01', '2024-01-01', '2024-01-01', '2021-10-01']),
    'date_max': pd.to_datetime(['2024-01-30', '2024-01-30', '2024-01-30', '2021-10-30']),
})

df1.conditional_join(
    df2,         
    ('index1', 'index1', '=='), 
    ('index2', 'index2', '=='),
    ('date', 'date_min', '>='), 
    ('date', 'date_max', '<='), 
)

Error Messages

AttributeError: ‘NoneType’ object has no attribute ‘size’

About this issue

Original URL
State: closed
Created 4 months ago
Comments: 18 (5 by maintainers)

Most upvoted comments

All done, @samukweku!

ericmjl on Mar 21, 2024

Thanks for the help!

By the way, out of curiosity, what is your use case for conditional_join that regular pandas could not solve?

Basically, I have a function that completes missing dates on a DataFrame with groups. This is an extract of that code:

import pandas as pd
import janitor
import random

df = pd.DataFrame({
    'index1': ['A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'C', 'C'],
    'index2': [1, 1, 1, 1, 1, 1, 2, 2, 2, 1, 1],
    'date': pd.to_datetime([
        '2024-01-01', '2024-01-10', '2024-04-01',
        '2024-01-01', '2024-01-10', '2024-04-01',
        '2024-01-01', '2024-01-10', '2024-04-01',
        '2021-10-01', '2021-11-01'
    ])
})
df['value'] = [random.randrange(1, 50, 1) for _ in range(df.shape[0])]

# Create min/max dates per group
df_group_dates = df.groupby(['index1', 'index2']).agg({'date': ['min', 'max']})
df_group_dates.columns = ["_".join(col) for col in df_group_dates.columns]
df_group_dates = df_group_dates.reset_index()

# Create a combination between all groups and all possible dates
date_min = df['date'].min()
date_max = df['date'].max()
date_range = pd.date_range(start=date_min, end=date_max, freq='D')
df_dates = pd.merge(
    df[['index1', 'index2']].drop_duplicates(),
    pd.DataFrame({'date': date_range}),
    how='cross'
)

# Join combinations with min/max dates per group
df_dates = pd.merge(
    df_dates, 
    df_group_dates[['index1', 'index2', 'date_min', 'date_max']],
    how='left',
    on=['index1', 'index2']
)
# Filter out date values outside the min/max range
df_dates = df_dates.loc[
    (df_dates['date'] >= df_dates['date_min']) &                      
    (df_dates['date'] <= df_dates['date_max']),
    ['date', 'index1', 'index2']
]

# Merge with original DataFrame
df = pd.merge(df, df_dates, on=['date', 'index1', 'index2'], how='right')

As you can see when I merge df_dates with df_group_dates I create a bigger DataFrame that is later filtered. This sometimes causes a memory error when the DataFrame is big.

Now, I can replace:

# Join combinations with min/max dates per group
df_dates = pd.merge(
    df_dates, 
    df_group_dates[['index1', 'index2', 'date_min', 'date_max']],
    how='left',
    on=['index1', 'index2']
)
# Filter out date values outside the min/max range
df_dates = df_dates.loc[
    (df_dates['date'] >= df_dates['date_min']) &                      
    (df_dates['date'] <= df_dates['date_max']),
    ['date', 'index1', 'index2']
]

with:

# Join combinations with min/max date ranges per group 
df_dates = df_dates.conditional_join(
    df_group_dates,         
    ('index1', 'index1', '=='),
    ('index2', 'index2', '=='),
    ('date', 'date_min', '>='), 
    ('date', 'date_max', '<='), 
    how = 'inner'
).drop([('right', col) for col in ['index1', 'index2', 'date_min', 'date_max']], axis=1).droplevel(0, axis=1)

I haven’t done any test to see if this is more efficient but, at least the code is smaller and more readable.

dugarte-vox on Mar 26, 2024

@dugarte-vox a new version of pyjanitor has been released. It should be fine on pandas > 2. Test and let’s know if there are any issues. By the way, out of curiosity, what is your use case for conditional_join that regular pandas could not solve?

samukweku on Mar 21, 2024

We haven’t migrated to pyproject.yaml yet, need your infra expertise on the PR

samukweku on Mar 19, 2024

What pandas version are you using @dugarte-vox ?

Sorry for the late response. I’m currently using pandas 2.2.1. I’ve tried with a lower version of pandas (2.0.3) as you suggested and the example is working.

dugarte-vox on Mar 19, 2024

Likely be a minor release

samukweku on Mar 19, 2024

Ok to do so! I trust your judgment on whether to do a patch or minor 😄.

ericmjl on Mar 19, 2024