scikit-learn: Inconsistent behaviour of decision trees with equal weights of samples

It seems to me the structure of trees shouldn`t change after multiplying the weights of all samples by some constant, but I get different results.

Example

from sklearn.tree import DecisionTreeClassifier

X = [[1, 0], [1, 2], [1, 1], [0, 3]]
y = [False, False, False, True]

for w in [1., 0.1, 0.01, 0.001, 0.0001]:
    model = DecisionTreeClassifier(random_state=0, class_weight={False: w, True: w})
    model.fit(X, y)
    print(w, model.tree_.threshold, model.tree_.feature)

Results:

1.0 [ 2.5 -2.  -2. ] [ 1 -2 -2]
0.1 [ 2.5 -2.  -2. ] [ 1 -2 -2]
0.01 [ 0.5 -2.  -2. ] [ 0 -2 -2]
0.001 [ 2.5 -2.  -2. ] [ 1 -2 -2]
0.0001 [ 2.5 -2.  -2. ] [ 1 -2 -2]

The same problem with w = 0.01 also occurs with DecisionTreeRegressor, when the weights passed through the fit method.

Interestingly, the problem disappears in this example with np.float32(w) instead of w, but still appears in different situations.

Versions: Scikit-Learn 0.17.1, NumPy 1.11.1, SciPy 0.17.1, Python 3.5.2, Windows 10.

About this issue

  • Original URL
  • State: closed
  • Created 8 years ago
  • Comments: 15 (13 by maintainers)

Most upvoted comments

so it seems like the issue is here, in the calculation of gini_left and gini_right.

The formula is 1.0 - sq_count_left / (self.weighted_n_left * self.weighted_n_left). If sq_count_left / (self.weighted_n_left * self.weighted_n_left) = 1.0, then interestingly 1.0 - sq_count_left / (self.weighted_n_left * self.weighted_n_left) != 0, but something very close. This is because while sq_count_left / (self.weighted_n_left * self.weighted_n_left) should be one, float errors make it so that it is merely very very close to 1.0. i had this issue when implementing MAE, and i got rid of it by factoring out the division (seems like += and division don’t play well). I’ll keep poking around for a fix.