pandas: PERF: `apply` on `boolean`-dtype DataFrame is exceedingly slow

  • I have checked that this issue has not already been reported.
  • I have confirmed this issue exists on the latest version of pandas.
  • I have confirmed this issue exists on the master branch of pandas.

Reproducible Example

from timeit import timeit
import pandas as pd

# Create a dummy DataFrame of bools and null values.
df = pd.DataFrame([[True, False, pd.NA] * 200] * 200, dtype = "object")

# This runs very slowly!
print(timeit(lambda: df.astype("boolean").apply(lambda row: row.count(), axis = 1), number = 10)) # 112s

# (As can be easily seen, this is due partly to the `astype` on the entire DataFrame, but mainly to the subsequent `apply` being particularly slow for a boolean-dtype DataFrame).
print(timeit(lambda: df.astype("boolean"), number = 10)) # 3.98s

# This *equivalent* statement runs fast. There seems to be no good reason why the call to `apply` in the previous statement (overall equivalent) must be so slow.
print(timeit(lambda: df.apply(lambda row: row.astype("boolean").count(), axis = 1), number = 10)) # 0.95s

There is no problem using the first method in some cases, but it is inconvenient in other situations to cast back and forth between object-dtype and boolean-dtype just to make apply efficient.

Installed Versions

commit : aced6eedf90f3fdb0e658f33ac89c13fad62b06e python : 3.9.7.final.0 python-bits : 64 OS : Darwin OS-release : 20.6.0 Version : Darwin Kernel Version 20.6.0: Wed Jun 23 00:26:31 PDT 2021; root:xnu-7195.141.2~5/RELEASE_X86_64 machine : x86_64 processor : i386 byteorder : little LC_ALL : None LANG : en_GB.UTF-8 LOCALE : en_GB.UTF-8

pandas : 1.4.0.dev0+970.gaced6eedf9 numpy : 1.20.2 pytz : 2021.3 dateutil : 2.8.2 pip : 21.3.1 setuptools : 58.2.0 Cython : 0.29.24 pytest : 6.2.5 hypothesis : 6.23.4 sphinx : 4.2.0 blosc : 1.10.6 feather : None xlsxwriter : None lxml.etree : 4.6.3 html5lib : 1.1 pymysql : None psycopg2 : None jinja2 : 3.0.2 IPython : None pandas_datareader: 0.10.0 bs4 : 4.10.0 bottleneck : None fsspec : 2021.10.1 fastparquet : None gcsfs : None matplotlib : 3.4.3 numexpr : None odfpy : None openpyxl : 3.0.9 pandas_gbq : None pyarrow : 5.0.0 pyxlsb : None s3fs : 2021.10.1 scipy : 1.7.1 sqlalchemy : None tables : None tabulate : 0.8.9 xarray : 0.19.0 xlrd : 2.0.1 xlwt : 1.3.0 numba : 0.54.1

Prior Performance

Not applicable.

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 20 (20 by maintainers)

Commits related to this issue

Most upvoted comments

A large part of the slowdown is actually coming from a small refactor commit from @jbrockmendel: https://github.com/pandas-dev/pandas/pull/43203 (reverting that speeds up the example of df2.apply(lambda row: row.count(), axis = 1) almost 10x; after that it’s still slower than the block version of course, as expected)

This is going to come as a shock to @jreback: it looks like the lack of 2D EA support is to blame.

df = pd.DataFrame([[True, False, pd.NA] * 200] * 200, dtype = "object")
df2 = df.astype("boolean")

df2.apply(lambda row: row.count(), axis = 1)

The .apply with axis=1 is iterating over rows, and df2.iloc[i] is super-slow bc it isn’t just slicing an existing array.

%prun -s cumtime df2.apply(lambda row: row.count(), axis = 1)
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    6.026    6.026 {built-in method builtins.exec}
        1    0.000    0.000    6.026    6.026 <string>:1(<module>)
        1    0.000    0.000    6.026    6.026 frame.py:8719(apply)
        1    0.000    0.000    6.026    6.026 apply.py:695(apply)
        1    0.000    0.000    6.026    6.026 apply.py:851(apply_standard)
        1    0.001    0.001    6.025    6.025 apply.py:857(apply_series_generator)
      201    0.000    0.000    6.016    0.030 apply.py:977(series_generator)
      201    0.002    0.000    5.988    0.030 frame.py:3469(_ixs)
      201    0.268    0.001    5.963    0.030 managers.py:951(fast_xs)
   120600    0.206    0.000    4.348    0.000 masked.py:211(__setitem__)

@alexreg certainly could be some unwanted coercion going on into apply. if you can investigate would be great.

you claim the astype is slow so prove it