scikit-learn: Inconsistent behaviour of decision trees with equal weights of samples
It seems to me the structure of trees shouldn`t change after multiplying the weights of all samples by some constant, but I get different results.
Example
from sklearn.tree import DecisionTreeClassifier
X = [[1, 0], [1, 2], [1, 1], [0, 3]]
y = [False, False, False, True]
for w in [1., 0.1, 0.01, 0.001, 0.0001]:
model = DecisionTreeClassifier(random_state=0, class_weight={False: w, True: w})
model.fit(X, y)
print(w, model.tree_.threshold, model.tree_.feature)
Results:
1.0 [ 2.5 -2. -2. ] [ 1 -2 -2]
0.1 [ 2.5 -2. -2. ] [ 1 -2 -2]
0.01 [ 0.5 -2. -2. ] [ 0 -2 -2]
0.001 [ 2.5 -2. -2. ] [ 1 -2 -2]
0.0001 [ 2.5 -2. -2. ] [ 1 -2 -2]
The same problem with w = 0.01 also occurs with DecisionTreeRegressor, when the weights passed through the fit method.
Interestingly, the problem disappears in this example with np.float32(w) instead of w, but still appears in different situations.
Versions: Scikit-Learn 0.17.1, NumPy 1.11.1, SciPy 0.17.1, Python 3.5.2, Windows 10.
About this issue
- Original URL
- State: closed
- Created 8 years ago
- Comments: 15 (13 by maintainers)
so it seems like the issue is here, in the calculation of
gini_leftandgini_right.The formula is
1.0 - sq_count_left / (self.weighted_n_left * self.weighted_n_left). Ifsq_count_left / (self.weighted_n_left * self.weighted_n_left) = 1.0, then interestingly1.0 - sq_count_left / (self.weighted_n_left * self.weighted_n_left) != 0, but something very close. This is because whilesq_count_left / (self.weighted_n_left * self.weighted_n_left)should be one, float errors make it so that it is merely very very close to 1.0. i had this issue when implementing MAE, and i got rid of it by factoring out the division (seems like+=and division don’t play well). I’ll keep poking around for a fix.