scikit-learn: [RFC] Missing values in RandomForest
So far, we have been using our preprocessing.Imputer
to take care of missing values before fitting the RandomForest
. Ref: this example
Proposal: It would be beneficial to have the missing values taken care of by all the Tree based classifiers / regressors natively…
Have a missing_values
variable which will be either None
(To raise error when we run into MVs) or int/nan value that will act as placeholder for missing_values
.
Different ways in which missing values can be handled (I might have naively added duplicates) -
-
(-1-1) Add an optional
imputation
variable, where we can either- Specify the strategy
'mean'
,'median'
,'most_frequent'
(ormissing_value
?) and let the clf construct theImputer
on the fly… - Pass a built
Imputer
object (like we do forscoring
orcv
) This was the simplest approach. Variants are 5 and 6.
Note that we can do this already using pipeline of imputer followed by the random forest.
- Specify the strategy
-
(-1+1) Ignore the missing values at the time of generating the splits.
-
(+1+1) Find the best split by sending the missing-valued samples either side and choosing the direction that brings about a maximum reduce in the entropy (impurity).
This is Gilles’ suggestion. This is conceptually same as the “separate-class-method”, where the missing values are considered as a separate categorical value. This is considered by Ding and Simonoff’s paper to be the best method in different situations.
-
As done in
rpart
, we could use sorrogate variables, where the strategy is to basically use the other features to decide the split, if one feature goes missing… -
Probabilistic method where the missing values are sent to both children, but are weighed with a proportion of the number of non-missing values in each split. Ref Ding and Simonoff’s paper’s paper.
I think, this goes something along the lines of this example [1, 2, 3, nan, nan, nan, nan], [0, 0, 1, 1, 1, 1, 0] - * Split with available value is L–> [1, 2] R --> [3] * Weights for the last 4 missing-valued samples is --> 2/3 R --> 1/3
-
Do imputation considering it as a supervised learning problem in itself, as done in MissForest. Build using available data --> Predict the missing values using this built model.
-
Impute the missing values using an inaccurate estimate (say using median imputation strategy). Build a RF on the completed data and update the missing values of each sample by the weighted mean value using proximity based methods. Repeat this until convergence. (Refer Gilles’ PhD Section 4.4.4)
-
Similar to 6. But one step method, where the imputation is done using the median of the k-nearest neighbors. Refer this airbnb blog.
-
Use ternary trees instead of binary trees with one branch dedicated for missing values? (Refer Gilles’ PhD Section 4.4.4).
This, I think, is conceptually similar to 4.
NOTE:
- 4, 7, 8, 9 are computationally intensive.
- 5 is not easy to do with our current API
- 3, 6 seem promising. I will implement 3 and see if I can extend that to 6 later
- Gilles’ -1 were for 1, 2 (The rest were added later)
- Ding and Simonoff’s paper which compares various methods and their relative accuracy is a good reference.
Taken from Ding and Simonoff’s paper the performance of various missing-value methods
About this issue
- Original URL
- State: closed
- Created 9 years ago
- Reactions: 16
- Comments: 51 (37 by maintainers)
Agree with above. All the boosting libraries have the ability to handle nulls natively. This should be a feature in sklearn for 2021. There should be at least the option to handle them.
I opened a PR https://github.com/scikit-learn/scikit-learn/pull/23595 as the first step to getting missing value support in random forest. The PR adds missing value support for decision trees and for
splitter="best"
and dense data to keep the PR smaller and easier to review. Once that PR is merged, follow up PRs will add missing value support for the random splitter, sparse data and ultimately enabling it for random forest.Closing this issue since the feature has been implemented.
Currently our boosting estimators:
HistGradientBoostingRegressor
andHistGradientBoostingClassifier
can handle missing values natively.This is the approach used by XGBoost.
In terms of theoretical analysis, you can find one in sec 5 of https://arxiv.org/abs/1902.06931 and sec 6 compares empirically multiple split strategy.
Another empirical study of missing values, more extensive, shows the benefit of handling in trees: https://academic.oup.com/gigascience/article/doi/10.1093/gigascience/giac013/6568998
Does anyone know what is the status of this?
Agree with folks above, this is a critical feature!
I think a boolean mask would have to be the way to go. Testing at the python level seems simpler than trying to test at the cython level. I’m concerned this will bog down the tree implementation, though.
@lesshaste Thanks a lot for the links… I’ll look into it 😃