pyjanitor: Problems with equalities in contional_join

Brief Description

I don’t know if I’m doing this correctly but, I want to apply 2 equalities and 2 inequalities (dates). If I apply one equality it doesn’t return an error but, when I apply both equalities an error is returned.

System Information

  • Operating system: Windows
  • OS details : 10
  • Python version: 3.9.18

Minimally Reproducible Code

import pandas as pd
import janitor
import random

df1 = pd.DataFrame({
    'index1': ['A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'C', 'C'],
    'index2': [1, 1, 1, 1, 1, 1, 2, 2, 2, 1, 1],
    'date': pd.to_datetime([
        '2024-01-01', '2024-01-10', '2024-04-01',
        '2024-01-01', '2024-01-10', '2024-04-01',
        '2024-01-01', '2024-01-10', '2024-04-01',
        '2021-10-01', '2021-11-01'
    ])
})
df1['value'] = [random.randrange(1, 50, 1) for i in range(df1.shape[0])]

df2 = pd.DataFrame({
    'index1': ['A', 'B', 'B', 'C'],
    'index2': [1, 1, 2, 1],
    'date_min': pd.to_datetime(['2024-01-01', '2024-01-01', '2024-01-01', '2021-10-01']),
    'date_max': pd.to_datetime(['2024-01-30', '2024-01-30', '2024-01-30', '2021-10-30']),
})

df1.conditional_join(
    df2,         
    ('index1', 'index1', '=='), 
    ('index2', 'index2', '=='),
    ('date', 'date_min', '>='), 
    ('date', 'date_max', '<='), 
)

Error Messages

AttributeError: ‘NoneType’ object has no attribute ‘size’

About this issue

  • Original URL
  • State: closed
  • Created 4 months ago
  • Comments: 18 (5 by maintainers)

Most upvoted comments

All done, @samukweku!

Thanks for the help!

By the way, out of curiosity, what is your use case for conditional_join that regular pandas could not solve?

Basically, I have a function that completes missing dates on a DataFrame with groups. This is an extract of that code:

import pandas as pd
import janitor
import random

df = pd.DataFrame({
    'index1': ['A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'C', 'C'],
    'index2': [1, 1, 1, 1, 1, 1, 2, 2, 2, 1, 1],
    'date': pd.to_datetime([
        '2024-01-01', '2024-01-10', '2024-04-01',
        '2024-01-01', '2024-01-10', '2024-04-01',
        '2024-01-01', '2024-01-10', '2024-04-01',
        '2021-10-01', '2021-11-01'
    ])
})
df['value'] = [random.randrange(1, 50, 1) for _ in range(df.shape[0])]

# Create min/max dates per group
df_group_dates = df.groupby(['index1', 'index2']).agg({'date': ['min', 'max']})
df_group_dates.columns = ["_".join(col) for col in df_group_dates.columns]
df_group_dates = df_group_dates.reset_index()

# Create a combination between all groups and all possible dates
date_min = df['date'].min()
date_max = df['date'].max()
date_range = pd.date_range(start=date_min, end=date_max, freq='D')
df_dates = pd.merge(
    df[['index1', 'index2']].drop_duplicates(),
    pd.DataFrame({'date': date_range}),
    how='cross'
)

# Join combinations with min/max dates per group
df_dates = pd.merge(
    df_dates, 
    df_group_dates[['index1', 'index2', 'date_min', 'date_max']],
    how='left',
    on=['index1', 'index2']
)
# Filter out date values outside the min/max range
df_dates = df_dates.loc[
    (df_dates['date'] >= df_dates['date_min']) &                      
    (df_dates['date'] <= df_dates['date_max']),
    ['date', 'index1', 'index2']
]

# Merge with original DataFrame
df = pd.merge(df, df_dates, on=['date', 'index1', 'index2'], how='right')

As you can see when I merge df_dates with df_group_dates I create a bigger DataFrame that is later filtered. This sometimes causes a memory error when the DataFrame is big.

Now, I can replace:

# Join combinations with min/max dates per group
df_dates = pd.merge(
    df_dates, 
    df_group_dates[['index1', 'index2', 'date_min', 'date_max']],
    how='left',
    on=['index1', 'index2']
)
# Filter out date values outside the min/max range
df_dates = df_dates.loc[
    (df_dates['date'] >= df_dates['date_min']) &                      
    (df_dates['date'] <= df_dates['date_max']),
    ['date', 'index1', 'index2']
]

with:

# Join combinations with min/max date ranges per group 
df_dates = df_dates.conditional_join(
    df_group_dates,         
    ('index1', 'index1', '=='),
    ('index2', 'index2', '=='),
    ('date', 'date_min', '>='), 
    ('date', 'date_max', '<='), 
    how = 'inner'
).drop([('right', col) for col in ['index1', 'index2', 'date_min', 'date_max']], axis=1).droplevel(0, axis=1)

I haven’t done any test to see if this is more efficient but, at least the code is smaller and more readable.

@dugarte-vox a new version of pyjanitor has been released. It should be fine on pandas > 2. Test and let’s know if there are any issues. By the way, out of curiosity, what is your use case for conditional_join that regular pandas could not solve?

We haven’t migrated to pyproject.yaml yet, need your infra expertise on the PR

What pandas version are you using @dugarte-vox ?

Sorry for the late response. I’m currently using pandas 2.2.1. I’ve tried with a lower version of pandas (2.0.3) as you suggested and the example is working.

Likely be a minor release

Ok to do so! I trust your judgment on whether to do a patch or minor 😄.