pandas: PERF: groupby slower than pure numpy (for many aggregation functions)
@ml31415 and I have just created/updated an aggregation package which has multiple equivalent implementations: pure python, numpy, pandas, and scipy.weave. As shown on the readme, pandas is slower than a careful numpy implementation for most aggregation functions, and slower than scipy.weave by a fairly wide margin in all cases. The worst functions appear to be any
and all
. Note that the times shown include construction of the dataframe, although whatever overhead this incurs cannot be more than the minimum time shown (11ms for groupby.first()
).
Surely pandas should be able to more or less match at least the numpy implementation, and maybe even the weave version (since it looks like pandas is also getting at low level C using cython)?
Note that although pandas may end up faster when performing multiple aggregations of the same grouping (e.g. first, last and max), the other implementations also have this potential (admittedly this optimization is yet to be implemented though).
About this issue
- Original URL
- State: closed
- Created 9 years ago
- Reactions: 2
- Comments: 30 (14 by maintainers)
I’m still confused. I get the benefit of pandas as I use it often in roughly the way it is intended. But the particular point we are discussing here is groupby using levels, where all the index values and labels already exist. So all I’m saying is that there’s no need to recompute them, if you care about perf.
@d1manson you are still missing the point. You are not comparing apples to apples, so how does it matter if something is faster than something that IS NOT COMPARABLE???
Yes, pandas is good at all those things, and that’s its main job, but I kind of hoped that it would also be getting close to optimal performance, which appears not to be the case here.
Maybe perf is not really that important to most people for the groupby machinery, but I thought it was worth pointing out that there is room for improvement - even if only for some subset of cases.