pandas: PERF: pd.BooleanDtype in row operations is 2000000 times slower

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this issue exists on the latest version of pandas.

  • I have confirmed this issue exists on the main branch of pandas.

Reproducible Example

import pandas as pd
import numpy as np

shape = 250_000, 100
mask = pd.DataFrame(np.random.randint(0, 1, size=shape))


np_mask = mask.astype(bool)
pd_mask = mask.astype(pd.BooleanDtype())

assert all(isinstance(dtype, pd.BooleanDtype) for dtype in pd_mask.dtypes)
assert all(isinstance(dtype, np.dtype) for dtype in np_mask.dtypes)
# column operations are not that much slower
%timeit pd_mask.any(axis=0) # 16.3 ms
%timeit np_mask.any(axis=0) # 5.86 ms
# using pandas.BooleanDtype back end for ROW operations is MUCH SLOWER
%timeit pd_mask.any(axis=1) # 14.1 s 
%timeit np_mask.any(axis=1) # 6.73 ms
16.3 ms ± 178 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
5.86 ms ± 467 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
14.1 s ± 329 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
6.73 ms ± 255 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Edit: additional context and unexpected behavior

import pandas as pd
import numpy as np
import pyarrow as pa

# Columns WIND_SPEED and WIND_GUST might be any number between 0 and 100, while the QC codes are 0-3
df = pd.DataFrame(
    {
        "WIND_SPEED": np.random.randint(0, 100, size=(100,)),
        "WIND_SPEED_QC": np.random.randint(0, 3, size=(100,)),
        "WIND_GUST": np.random.randint(0, 100, size=(100,)),
        "WIND_GUST_QC": np.random.randint(0, 3, size=(100,)),
    }
# I've been looking into the pyarrow dtypes and encountered a unexpected behavior
).astype(pd.ArrowDtype(pa.uint8()))
# the equality comparison returns a pd.BooleanDtype rather than pd.ArrowDtype
mask = df[["WIND_SPEED_QC", "WIND_GUST_QC"]].__ge__(1)
# this is not expected
assert all(isinstance(x, pd.BooleanDtype) for x in mask.dtypes)
# pd.BooleanDtype
%timeit mask.any(axis=1) # 5.35 ms
# pd.ArrowDtype
%timeit mask.astype(pd.ArrowDtype(pa.bool_())).any(axis=1) # 7.24 ms 
# np.bool_
%timeit mask.astype(bool).any(axis=1) # 197 µs

The pd.BooleanDtype is faster than the pd.ArrowDtype which makes sense, but if the backend is going to change it would make sense to use the np.bool_ dtype.

5.35 ms ± 377 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
7.24 ms ± 278 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
197 µs ± 7.6 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Installed Versions

INSTALLED VERSIONS

commit : 1a2e300170efc08cb509a0b4ff6248f8d55ae777 python : 3.10.6.final.0 python-bits : 64 OS : Linux OS-release : 5.15.90.1-microsoft-standard-WSL2 Version : #1 SMP Fri Jan 27 02:56:13 UTC 2023 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : C.UTF-8 LOCALE : en_US.UTF-8

pandas : 2.0.0rc0 numpy : 1.24.2 pytz : 2022.7.1 dateutil : 2.8.2 setuptools : 59.6.0 pip : 22.0.2 Cython : None pytest : 7.2.2 hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 3.1.2 IPython : 8.11.0 pandas_datareader: None bs4 : None bottleneck : None brotli : None fastparquet : None fsspec : 2023.3.0 gcsfs : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : 11.0.0 pyreadstat : None pyxlsb : None s3fs : 2023.3.0 scipy : None snappy : None sqlalchemy : 2.0.4 tables : None tabulate : None xarray : None xlrd : None zstandard : None tzdata : None qtpy : None pyqt5 : None

Prior Performance

No response

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Reactions: 2
  • Comments: 20 (11 by maintainers)

Most upvoted comments

I am not seeing this on 2.1.4. Can you post your timings from the code in the OP?

Sorry for the inconvenience. However, I’ve noticed that the issue still persists in version 2.1.4. I had thought that the problem had been resolved by #54341. Could you tell me why it still exists? Thanks.

Got the same result (essentially) on pandas 1.5.3.

20.3 ms ± 920 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
13.9 ms ± 379 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
3.01 s ± 103 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
10.8 ms ± 241 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Details

Installed versions

commit : 2e218d10984e9919f0296931d92ea851c6a6faf5 python : 3.11.1.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.19045 machine : AMD64 processor : Intel64 Family 6 Model 126 Stepping 5, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : English_United States.1252

pandas : 1.5.3 numpy : 1.25.0.dev0+896.g071388f95 pytz : 2021.3 dateutil : 2.8.2 setuptools : 59.2.0 pip : 23.0.1 Cython : 0.29.33 pytest : 6.2.5 hypothesis : 6.24.1 sphinx : 6.1.3 blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : 2.9.5 jinja2 : 3.1.2 IPython : 8.9.0 pandas_datareader: None bs4 : 4.11.2 bottleneck : None brotli : None fastparquet : None fsspec : None gcsfs : None matplotlib : 3.6.3 numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyreadstat : None pyxlsb : None s3fs : None scipy : 1.10.0 snappy : None sqlalchemy : 2.0.2 tables : None tabulate : None xarray : None xlrd : None xlwt : None zstandard : None tzdata : 2