scikit-learn: Trees with MAE criterion are slow to train
Description
when I use ‘mae’ criterion for the model extratreesregressor, training for a long time, it’s seems lead to an endless training. there have no problem for mse I find not only me hava this problem. I hava tried two version (0.18 and 0.19.X) , but no used .
https://www.kaggle.com/c/allstate-claims-severity/discussion/24293
Steps/Code to Reproduce
from sklearn.ensemble import ExtraTreesRegressor
rfr = ExtraTreesRegressor(n_estimators=100,
max_features=0.8,
criterion='mae',
max_depth=6,
min_samples_leaf=200,
n_jobs=-1,
random_state=17,
verbose=0)
mod = rfr.fit(train[distilled_features], train['y'])
Expected Results
can normal training in my model when mae used
Actual Results
aways traing in fit step
Versions
Darwin-16.1.0-x86_64-i386-64bit Python 3.6.1 |Anaconda 4.4.0 (x86_64)| (default, May 11 2017, 13:04:09) [GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)] NumPy 1.12.1 SciPy 0.19.0 Scikit-Learn 0.19.X
About this issue
- Original URL
- State: open
- Created 7 years ago
- Reactions: 26
- Comments: 61 (40 by maintainers)
I noticed this is currently still an issue - my training with “mae” as criterion does not finish (Grid Search with GradientBoosting Regression Trees. I spent a lot of time trying to debug what was wrong before stumbling on this thread. This is why I would propose adding a warning in the documentation (e.g. “training with ‘mae’ as criterion means recalculating the loss function after each iteration and can take an extremely long time”) to prevent future coders getting stuck at the same problem.
Observation
When using
criterion='absolute_error'
, most of the time is spent pushing, popping or removing element from theWeightedMedianCalculator
’sWeightedPQueue
: https://github.com/scikit-learn/scikit-learn/blob/2f65ac764be40e2420817cbeca3301c8d664baa3/sklearn/tree/_utils.pyx#L78-L95This slows down the execution because data-structures are being resized by utilities functions for memory management: https://github.com/scikit-learn/scikit-learn/blob/2f65ac764be40e2420817cbeca3301c8d664baa3/sklearn/tree/_utils.pyx#L19-L73
Currently copies are made and memory is moved (see
__memmove_evex_unaligned_erms
using elements of the last section bellow). Moreover, the reallocation generally blocks when using default allocators, which currently is the case of scikit-learn: in a multi-threaded context (e.g. when usingRandomForest*
orExtraTrees*
), this can create significant contention at the level of the OS, bringing performance down.Proposed solutions (including proposals for an generally improved maintenance):
WeightedPQueue
and custom function for memory managementmimalloc
to alleviate thread contention on reallocations (this would also helps reallocation ofstd::vector
)Elements for troubleshooting performance
Script
Inspection with
py-spy
A small follow-up and inspection with
py-spy
with this command (taskset(1)
is used to limit the execution to one thread):gives the following SpeedScope-inspectable report:
sklearn_9626
Inspection with
perf(1)
gives when using hierarchical report:
https://github.com/yupbank/np_decision_tree/blob/master/decision_tree/strategy_l1.py#L87
I did something in the past to abstract the core data structure into a running median to make faster MAE splits.
+1 for one or more PRs to implement the proposed solutions and compare them with an ad-hoc benchmark.
Note: for
py-spy
output, if you use-f speedscope
, please use a.json
filename, otherwise it’s quite confusing (even for my firefox).That commit is written by past Jeremy who was much better informed on the issue. It’s probably the best I can give you without diving into it again, which I haven’t had the time for. I can also say reflecting back that I didn’t succeed because I wasn’t able to set up a good CI cycle for the WIP code (wasn’t as familiar with Cython and testing as I am now), and because I got segmentation faults. I also suspect my code logic around the update was flawed, this could be avoided with some good test cases when implementing the update.
I would suggest starting over from scratch instead of from my very outdated branch, and using the commit as a reference of where you might need to add code moreso than what code you need to add (the descriptions in this issue are better for that).
PSA, users are encouraged to use
HistGradientBoostingRegressor
withloss='least_absolute_deviation'
which will be considerably faster@Broundal it’s because when the criterion is MAE (and as a result the random forest outputs the median instead of the mean), it is more difficult to update the loss function. For MSE you can take the original variance and perform a constant time update of the loss when you move a sample from one side of the partition to the other, but for MAE currently the loss is completely recomputed, which takes O(N) time. As a result, the entire process is O(N^2) instead of O(N). The only time they’ll perform comparably is if you have very few samples. At 10k samples you can at least have your computation finish in 20 seconds, at 800k samples the N^2 scaling would imply you need 128000 seconds, or 1.5 days. Conversely 1000 samples should only take 0.2 seconds.
We think there’s a faster way to update the loss function.
What was your idea?
I guess, theoretically, it boils down to computing the exact median which costs
O(n log n)
, right? How about an approximation of the median in linear time, like the Median of Medians?Especially when it comes to random forests, the error introduced by the median approximation can be neglected or rather seen as another source of randomness.
Is there any update on this? Running a multi-output Random Forest Regresser with the
mae
criterion and a couple of million samples takes days in comparison to minutes formse
.