pandas: PERF: `apply` on `boolean`-dtype DataFrame is exceedingly slow
- I have checked that this issue has not already been reported.
- I have confirmed this issue exists on the latest version of pandas.
- I have confirmed this issue exists on the master branch of pandas.
Reproducible Example
from timeit import timeit
import pandas as pd
# Create a dummy DataFrame of bools and null values.
df = pd.DataFrame([[True, False, pd.NA] * 200] * 200, dtype = "object")
# This runs very slowly!
print(timeit(lambda: df.astype("boolean").apply(lambda row: row.count(), axis = 1), number = 10)) # 112s
# (As can be easily seen, this is due partly to the `astype` on the entire DataFrame, but mainly to the subsequent `apply` being particularly slow for a boolean-dtype DataFrame).
print(timeit(lambda: df.astype("boolean"), number = 10)) # 3.98s
# This *equivalent* statement runs fast. There seems to be no good reason why the call to `apply` in the previous statement (overall equivalent) must be so slow.
print(timeit(lambda: df.apply(lambda row: row.astype("boolean").count(), axis = 1), number = 10)) # 0.95s
There is no problem using the first method in some cases, but it is inconvenient in other situations to cast back and forth between object-dtype and boolean-dtype just to make apply
efficient.
Installed Versions
commit : aced6eedf90f3fdb0e658f33ac89c13fad62b06e python : 3.9.7.final.0 python-bits : 64 OS : Darwin OS-release : 20.6.0 Version : Darwin Kernel Version 20.6.0: Wed Jun 23 00:26:31 PDT 2021; root:xnu-7195.141.2~5/RELEASE_X86_64 machine : x86_64 processor : i386 byteorder : little LC_ALL : None LANG : en_GB.UTF-8 LOCALE : en_GB.UTF-8
pandas : 1.4.0.dev0+970.gaced6eedf9 numpy : 1.20.2 pytz : 2021.3 dateutil : 2.8.2 pip : 21.3.1 setuptools : 58.2.0 Cython : 0.29.24 pytest : 6.2.5 hypothesis : 6.23.4 sphinx : 4.2.0 blosc : 1.10.6 feather : None xlsxwriter : None lxml.etree : 4.6.3 html5lib : 1.1 pymysql : None psycopg2 : None jinja2 : 3.0.2 IPython : None pandas_datareader: 0.10.0 bs4 : 4.10.0 bottleneck : None fsspec : 2021.10.1 fastparquet : None gcsfs : None matplotlib : 3.4.3 numexpr : None odfpy : None openpyxl : 3.0.9 pandas_gbq : None pyarrow : 5.0.0 pyxlsb : None s3fs : 2021.10.1 scipy : 1.7.1 sqlalchemy : None tables : None tabulate : 0.8.9 xarray : 0.19.0 xlrd : 2.0.1 xlwt : 1.3.0 numba : 0.54.1
Prior Performance
Not applicable.
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 20 (20 by maintainers)
Commits related to this issue
- PERF: improve efficiency of `BaseMaskedArray.__setitem__` This somewhat deals with #44172, though that won't be fully resolved until 2D `ExtensionArray`s are supported (per the comments there). — committed to alexreg/pandas by alexreg 3 years ago
- PERF: improve efficiency of `BaseMaskedArray.__setitem__` This somewhat deals with #44172, though that won't be fully resolved until 2D `ExtensionArray`s are supported (per the comments there). — committed to alexreg/pandas by alexreg 3 years ago
- PERF: improve efficiency of `BaseMaskedArray.__setitem__` This somewhat deals with #44172, though that won't be fully resolved until 2D `ExtensionArray`s are supported (per the comments there). — committed to alexreg/pandas by alexreg 3 years ago
- PERF: improve efficiency of `BaseMaskedArray.__setitem__` This somewhat deals with #44172, though that won't be fully resolved until 2D `ExtensionArray`s are supported (per the comments there). — committed to alexreg/pandas by alexreg 3 years ago
- PERF: improve efficiency of `BaseMaskedArray.__setitem__` This somewhat deals with #44172, though that won't be fully resolved until 2D `ExtensionArray`s are supported (per the comments there). — committed to alexreg/pandas by alexreg 3 years ago
A large part of the slowdown is actually coming from a small refactor commit from @jbrockmendel: https://github.com/pandas-dev/pandas/pull/43203 (reverting that speeds up the example of
df2.apply(lambda row: row.count(), axis = 1)
almost 10x; after that it’s still slower than the block version of course, as expected)This is going to come as a shock to @jreback: it looks like the lack of 2D EA support is to blame.
The .apply with axis=1 is iterating over rows, and
df2.iloc[i]
is super-slow bc it isn’t just slicing an existing array.@alexreg certainly could be some unwanted coercion going on into apply. if you can investigate would be great.
you claim the astype is slow so prove it