scikit-learn: OOB calculation in Random Forests seems to use incorrect indices

update 9 March 2021

After the discussion below and further testing, I came to the conclusion that this is not a bug after all. The indices are computed correctly, because _make_estimator always generates a new random seed that is stored in the random_state variable of each tree – even if a np.random.RandomState() object was passed to _make_estimator.

Apologies for the false alarm. I got confused by the interchangeable use of random_state as a seed or a random generator (but, interestingly, not as the actual state of the random generator).

original report

I have been looking into the OOB score calculation for random forests, and I am concerned that the indices considered to be “out-of-bag” are not actually the same as those held out for training. A few years ago, PR https://github.com/scikit-learn/scikit-learn/issues/4783 was implemented, which replaced the explicit storage of indices by a regeneration procedure.

Specifically, this resulted in the following code fragment in _compute_oob_predictions(self, X, y) in sklearn/ensemble/_forest.py:

for estimator in self.estimators_:
    unsampled_indices = _generate_unsampled_indices(
        estimator.random_state, n_samples, n_samples_bootstrap,
    )

    y_pred = self._get_oob_predictions(
        estimator, X[unsampled_indices, :]
    )
    oob_pred[unsampled_indices, ...] += y_pred
    n_oob_pred[unsampled_indices, :] += 1

If estimator.random_state is a np.RandomState object instead of a constant seed then this appears to generate new random indices instead of the original set of indices. I noticed that this potential shortcoming was also observed by @amueller in the discussion about https://github.com/scikit-learn/scikit-learn/issues/4783, but no follow-up was included there:

Silly question: This fails if a np.RandomState was passed to the forest, right? Because then the random_state would have been altered by building the tree? (or I’m tired)

_Originally posted by @amueller in https://github.com/scikit-learn/scikit-learn/issues/4783#issuecomment-236036188_

I think this is more generally a concern, because even if a constant seed was passed to the RandomForest initializer, then BaseForest.fit(…) implements random_state = check_random_state(self.random_state) which makes random_state a RandomState() object, followed by (a few lines below): trees = [self._make_estimator(append=False, random_state=random_state) for i in range(n_more_estimators)] All trees appear to share the same RandomState() instance, which is then used to generate indices. Calling _generate_sample_indices(...) again as part of _generate_unsampled_indices(..) would seem to always generate different indices.

Am I missing something, or can this procedure never work correctly?

About this issue

  • Original URL
  • State: open
  • Created 3 years ago
  • Comments: 17 (6 by maintainers)

Most upvoted comments

@NicolasHug, I can confirm that the original code, although in a slightly obfuscated manner, does work correctly. See also my update in the top post. You can remove the bug label from this post, although the proposed modifications (and discussion in #19648) may still be useful to improve maintainability of the code.