pandas: BUG: very slow groupby(col1)[col2].value_counts() for columns of type 'category'

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

# EXAMPLE 1

import numpy as np
import pandas as pd
import time

t0 = time.time()
col1_possible_values = ["".join(np.random.choice(list("ABCDEFGHIJKLMNOPRSTUVWXYZ"), 20)) for _ in range(700000)]
col2_possible_values = ["".join(np.random.choice(list("ABCDEFGHIJKLMNOPRSTUVWXYZ"), 10)) for _ in range(860)]

col1_values = np.random.choice(col1_possible_values, size=13000000, replace=True)
col2_values = np.random.choice(col2_possible_values, size=13000000, replace=True)

sample_df = pd.DataFrame(zip(col1_values, col2_values), columns=["col1", "col2"])
print(time.time()-t0)

t0 = time.time()
processed_df = sample_df.groupby("col1")["col2"].value_counts().unstack()
print(time.time()-t0)

# EXAMPLE 2

import numpy as np
import pandas as pd
import time

t0 = time.time()
col1_possible_values = ["".join(np.random.choice(list("ABCDEFGHIJKLMNOPRSTUVWXYZ"), 20)) for _ in range(700000)]
col2_possible_values = ["".join(np.random.choice(list("ABCDEFGHIJKLMNOPRSTUVWXYZ"), 10)) for _ in range(860)]

col1_values = np.random.choice(col1_possible_values, size=13000000, replace=True)
col2_values = np.random.choice(col2_possible_values, size=13000000, replace=True)

sample_df = pd.DataFrame(zip(col1_values, col2_values), columns=["col1", "col2"])
sample_df['col2'] = sample_df['col2'].astype('category')
print(time.time()-t0)

t0 = time.time()
processed_df = sample_df.groupby("col1")["col2"].value_counts().unstack()
print(time.time()-t0)

Issue Description

I have 2022 Macbook Pro M1 Pro, pandas 1.4.1, numpy 1.22.2

I noticed significant performance drop when trying to perform

sample_df.groupby(“col1”)[“col2”].value_counts().unstack()

when col2 or both col1 and col2 are of type ‘category’ instead of default type ‘object’. Operation in EXAMPLE 1 runs around ~25 seconds on my computer (similar for 2019 Macbook Pro with Intel processor). In EXAMPLE 2 I have run the operation for more than 20 minutes and it still did not finish (on 2019 Macbook Pro with Intel processor running time is similar for EXAMPLE 1 and EXAMPLE 2).

Moreover, for M1 in EXAMPLE 1, the peak memory usage is around 8-10GB while for EXAMPLE 2 it well exceeds 30GB.

Expected Behavior

on M1:

EXAMPLE 1 and EXAMPLE 2 should perform roughly the same as they do on Intel processor.

Installed Versions

INSTALLED VERSIONS

commit : 06d230151e6f18fdb8139d09abf539867a8cd481 python : 3.9.10.final.0 python-bits : 64 OS : Darwin OS-release : 21.3.0 Version : Darwin Kernel Version 21.3.0: Wed Jan 5 21:37:58 PST 2022; root:xnu-8019.80.24~20/RELEASE_ARM64_T6000 machine : arm64 processor : arm byteorder : little LC_ALL : None LANG : None LOCALE : None.UTF-8

pandas : 1.4.1 numpy : 1.22.2 pytz : 2021.3 dateutil : 2.8.2 pip : 22.0.3 setuptools : 60.9.3 Cython : 0.29.28 pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : 2.9.3 jinja2 : 3.0.3 IPython : 8.0.1 pandas_datareader: None bs4 : None bottleneck : None fastparquet : None fsspec : None gcsfs : None matplotlib : 3.5.1 numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : 7.0.0 pyreadstat : None pyxlsb : None s3fs : None scipy : 1.8.0 sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None zstandard : None None

About this issue

Original URL
State: closed
Created 2 years ago
Comments: 17 (14 by maintainers)

Most upvoted comments

While this is partially true, I still think there is a significant performance issue with SeriesGroupBy

Yes, what I meant is that when both this issue and https://github.com/pandas-dev/pandas/issues/46357 are fixed, it won’t be possible to retrieve only observed values in order to get the best performances using categorical dtype, but a prior casting to object dtype will be needed.

LucasG0 on Mar 15, 2022

@rhshadrach It does not seem it has been reported yet. I created the issue and also detailled why I have a doubt whether the observed parameter should have an effect here

LucasG0 on Mar 13, 2022

@whypandasslow - thanks for the response. Using DataFrameGroupby is merely a workaround. The performance issues in SeriesGroupBy should be fixed.

rhshadrach on Mar 7, 2022