scikit-learn: [RFC] Missing values in RandomForest

So far, we have been using our preprocessing.Imputer to take care of missing values before fitting the RandomForest. Ref: this example

Proposal: It would be beneficial to have the missing values taken care of by all the Tree based classifiers / regressors natively…

Have a missing_values variable which will be either None (To raise error when we run into MVs) or int/nan value that will act as placeholder for missing_values.

Different ways in which missing values can be handled (I might have naively added duplicates) -

(-1-1) Add an optional imputation variable, where we can either
- Specify the strategy 'mean', 'median', 'most_frequent' (or missing_value?) and let the clf construct the Imputer on the fly…
- Pass a built Imputer object (like we do for scoring or cv) This was the simplest approach. Variants are 5 and 6.
Note that we can do this already using pipeline of imputer followed by the random forest.
(-1+1) Ignore the missing values at the time of generating the splits.
(+1+1) Find the best split by sending the missing-valued samples either side and choosing the direction that brings about a maximum reduce in the entropy (impurity).

This is Gilles’ suggestion. This is conceptually same as the “separate-class-method”, where the missing values are considered as a separate categorical value. This is considered by Ding and Simonoff’s paper to be the best method in different situations.
As done in rpart, we could use sorrogate variables, where the strategy is to basically use the other features to decide the split, if one feature goes missing…
Probabilistic method where the missing values are sent to both children, but are weighed with a proportion of the number of non-missing values in each split. Ref Ding and Simonoff’s paper’s paper.

I think, this goes something along the lines of this example [1, 2, 3, nan, nan, nan, nan], [0, 0, 1, 1, 1, 1, 0] - * Split with available value is L–> [1, 2] R --> [3] * Weights for the last 4 missing-valued samples is --> 2/3 R --> 1/3
Do imputation considering it as a supervised learning problem in itself, as done in MissForest. Build using available data --> Predict the missing values using this built model.
Impute the missing values using an inaccurate estimate (say using median imputation strategy). Build a RF on the completed data and update the missing values of each sample by the weighted mean value using proximity based methods. Repeat this until convergence. (Refer Gilles’ PhD Section 4.4.4)
Similar to 6. But one step method, where the imputation is done using the median of the k-nearest neighbors. Refer this airbnb blog.
Use ternary trees instead of binary trees with one branch dedicated for missing values? (Refer Gilles’ PhD Section 4.4.4).

This, I think, is conceptually similar to 4.

NOTE:

4, 7, 8, 9 are computationally intensive.
5 is not easy to do with our current API
3, 6 seem promising. I will implement 3 and see if I can extend that to 6 later
Gilles’ -1 were for 1, 2 (The rest were added later)
Ding and Simonoff’s paper which compares various methods and their relative accuracy is a good reference.

Taken from Ding and Simonoff’s paper the performance of various missing-value methods

CC: @agramfort @GaelVaroquaux @glouppe @arjoly

About this issue

Original URL
State: closed
Created 9 years ago
Reactions: 16
Comments: 51 (37 by maintainers)

Most upvoted comments

Agree with above. All the boosting libraries have the ability to handle nulls natively. This should be a feature in sklearn for 2021. There should be at least the option to handle them.

onacrame on Jan 10, 2021

I opened a PR https://github.com/scikit-learn/scikit-learn/pull/23595 as the first step to getting missing value support in random forest. The PR adds missing value support for decision trees and for splitter="best" and dense data to keep the PR smaller and easier to review. Once that PR is merged, follow up PRs will add missing value support for the random splitter, sparse data and ultimately enabling it for random forest.

thomasjpfan on Jun 26, 2022

Closing this issue since the feature has been implemented.

glemaitre on Nov 8, 2023

Currently our boosting estimators: HistGradientBoostingRegressor and HistGradientBoostingClassifier can handle missing values natively.

thomasjpfan on Jan 12, 2021

Choice of Technique: Your chosen procedure appears to be consistent with Find the optimal split by sending the missing-valued samples to either side and selecting the direction that results in the greatest reduction in entropy (impurity) Could you explain the rationale or empirical or theoretical foundations that led to the selection of this method? Are there any particular papers or resources we could consult to better comprehend the reasoning and evidence supporting this decision?

This is the approach used by XGBoost.

In terms of theoretical analysis, you can find one in sec 5 of https://arxiv.org/abs/1902.06931 and sec 6 compares empirically multiple split strategy.

Another empirical study of missing values, more extensive, shows the benefit of handling in trees: https://academic.oup.com/gigascience/article/doi/10.1093/gigascience/giac013/6568998

GaelVaroquaux on Oct 16, 2023

Does anyone know what is the status of this?

kennethleungty on Jun 26, 2022

Agree with folks above, this is a critical feature!

graham-pendragon on Aug 18, 2021

I think a boolean mask would have to be the way to go. Testing at the python level seems simpler than trying to test at the cython level. I’m concerned this will bog down the tree implementation, though.

jmschrei on Nov 24, 2015

@lesshaste Thanks a lot for the links… I’ll look into it 😃

raghavrv on Nov 18, 2015