pandas: PERF: `apply` on `boolean`-dtype DataFrame is exceedingly slow

I have checked that this issue has not already been reported.
I have confirmed this issue exists on the latest version of pandas.
I have confirmed this issue exists on the master branch of pandas.

Reproducible Example

from timeit import timeit
import pandas as pd

# Create a dummy DataFrame of bools and null values.
df = pd.DataFrame([[True, False, pd.NA] * 200] * 200, dtype = "object")

# This runs very slowly!
print(timeit(lambda: df.astype("boolean").apply(lambda row: row.count(), axis = 1), number = 10)) # 112s

# (As can be easily seen, this is due partly to the `astype` on the entire DataFrame, but mainly to the subsequent `apply` being particularly slow for a boolean-dtype DataFrame).
print(timeit(lambda: df.astype("boolean"), number = 10)) # 3.98s

# This *equivalent* statement runs fast. There seems to be no good reason why the call to `apply` in the previous statement (overall equivalent) must be so slow.
print(timeit(lambda: df.apply(lambda row: row.astype("boolean").count(), axis = 1), number = 10)) # 0.95s

There is no problem using the first method in some cases, but it is inconvenient in other situations to cast back and forth between object-dtype and boolean-dtype just to make apply efficient.

Installed Versions

commit : aced6eedf90f3fdb0e658f33ac89c13fad62b06e python : 3.9.7.final.0 python-bits : 64 OS : Darwin OS-release : 20.6.0 Version : Darwin Kernel Version 20.6.0: Wed Jun 23 00:26:31 PDT 2021; root:xnu-7195.141.2~5/RELEASE_X86_64 machine : x86_64 processor : i386 byteorder : little LC_ALL : None LANG : en_GB.UTF-8 LOCALE : en_GB.UTF-8

pandas : 1.4.0.dev0+970.gaced6eedf9 numpy : 1.20.2 pytz : 2021.3 dateutil : 2.8.2 pip : 21.3.1 setuptools : 58.2.0 Cython : 0.29.24 pytest : 6.2.5 hypothesis : 6.23.4 sphinx : 4.2.0 blosc : 1.10.6 feather : None xlsxwriter : None lxml.etree : 4.6.3 html5lib : 1.1 pymysql : None psycopg2 : None jinja2 : 3.0.2 IPython : None pandas_datareader: 0.10.0 bs4 : 4.10.0 bottleneck : None fsspec : 2021.10.1 fastparquet : None gcsfs : None matplotlib : 3.4.3 numexpr : None odfpy : None openpyxl : 3.0.9 pandas_gbq : None pyarrow : 5.0.0 pyxlsb : None s3fs : 2021.10.1 scipy : 1.7.1 sqlalchemy : None tables : None tabulate : 0.8.9 xarray : 0.19.0 xlrd : 2.0.1 xlwt : 1.3.0 numba : 0.54.1

Prior Performance

Not applicable.

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 20 (20 by maintainers)

Commits related to this issue

PERF: improve efficiency of `BaseMaskedArray.__setitem__` This somewhat deals with #44172, though that won't be fully resolved until 2D `ExtensionArray`s are supported (per the comments there). — committed to alexreg/pandas by alexreg 3 years ago
PERF: improve efficiency of `BaseMaskedArray.__setitem__` This somewhat deals with #44172, though that won't be fully resolved until 2D `ExtensionArray`s are supported (per the comments there). — committed to alexreg/pandas by alexreg 3 years ago
PERF: improve efficiency of `BaseMaskedArray.__setitem__` This somewhat deals with #44172, though that won't be fully resolved until 2D `ExtensionArray`s are supported (per the comments there). — committed to alexreg/pandas by alexreg 3 years ago
PERF: improve efficiency of `BaseMaskedArray.__setitem__` This somewhat deals with #44172, though that won't be fully resolved until 2D `ExtensionArray`s are supported (per the comments there). — committed to alexreg/pandas by alexreg 3 years ago
PERF: improve efficiency of `BaseMaskedArray.__setitem__` This somewhat deals with #44172, though that won't be fully resolved until 2D `ExtensionArray`s are supported (per the comments there). — committed to alexreg/pandas by alexreg 3 years ago

Most upvoted comments

A large part of the slowdown is actually coming from a small refactor commit from @jbrockmendel: https://github.com/pandas-dev/pandas/pull/43203 (reverting that speeds up the example of df2.apply(lambda row: row.count(), axis = 1) almost 10x; after that it’s still slower than the block version of course, as expected)

jorisvandenbossche on Oct 25, 2021

This is going to come as a shock to @jreback: it looks like the lack of 2D EA support is to blame.

df = pd.DataFrame([[True, False, pd.NA] * 200] * 200, dtype = "object")
df2 = df.astype("boolean")

df2.apply(lambda row: row.count(), axis = 1)

The .apply with axis=1 is iterating over rows, and df2.iloc[i] is super-slow bc it isn’t just slicing an existing array.

%prun -s cumtime df2.apply(lambda row: row.count(), axis = 1)
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    6.026    6.026 {built-in method builtins.exec}
        1    0.000    0.000    6.026    6.026 <string>:1(<module>)
        1    0.000    0.000    6.026    6.026 frame.py:8719(apply)
        1    0.000    0.000    6.026    6.026 apply.py:695(apply)
        1    0.000    0.000    6.026    6.026 apply.py:851(apply_standard)
        1    0.001    0.001    6.025    6.025 apply.py:857(apply_series_generator)
      201    0.000    0.000    6.016    0.030 apply.py:977(series_generator)
      201    0.002    0.000    5.988    0.030 frame.py:3469(_ixs)
      201    0.268    0.001    5.963    0.030 managers.py:951(fast_xs)
   120600    0.206    0.000    4.348    0.000 masked.py:211(__setitem__)

jbrockmendel on Oct 25, 2021

@alexreg certainly could be some unwanted coercion going on into apply. if you can investigate would be great.

jreback on Oct 25, 2021

you claim the astype is slow so prove it

jreback on Oct 25, 2021