scikit-learn: Inconsistent behaviour of decision trees with equal weights of samples
It seems to me the structure of trees shouldn`t change after multiplying the weights of all samples by some constant, but I get different results.
Example
from sklearn.tree import DecisionTreeClassifier
X = [[1, 0], [1, 2], [1, 1], [0, 3]]
y = [False, False, False, True]
for w in [1., 0.1, 0.01, 0.001, 0.0001]:
model = DecisionTreeClassifier(random_state=0, class_weight={False: w, True: w})
model.fit(X, y)
print(w, model.tree_.threshold, model.tree_.feature)
Results:
1.0 [ 2.5 -2. -2. ] [ 1 -2 -2]
0.1 [ 2.5 -2. -2. ] [ 1 -2 -2]
0.01 [ 0.5 -2. -2. ] [ 0 -2 -2]
0.001 [ 2.5 -2. -2. ] [ 1 -2 -2]
0.0001 [ 2.5 -2. -2. ] [ 1 -2 -2]
The same problem with w = 0.01
also occurs with DecisionTreeRegressor
, when the weights passed through the fit method.
Interestingly, the problem disappears in this example with np.float32(w)
instead of w
, but still appears in different situations.
Versions: Scikit-Learn 0.17.1, NumPy 1.11.1, SciPy 0.17.1, Python 3.5.2, Windows 10.
About this issue
- Original URL
- State: closed
- Created 8 years ago
- Comments: 15 (13 by maintainers)
so it seems like the issue is here, in the calculation of
gini_left
andgini_right
.The formula is
1.0 - sq_count_left / (self.weighted_n_left * self.weighted_n_left)
. Ifsq_count_left / (self.weighted_n_left * self.weighted_n_left) = 1.0
, then interestingly1.0 - sq_count_left / (self.weighted_n_left * self.weighted_n_left) != 0
, but something very close. This is because whilesq_count_left / (self.weighted_n_left * self.weighted_n_left)
should be one, float errors make it so that it is merely very very close to 1.0. i had this issue when implementing MAE, and i got rid of it by factoring out the division (seems like+=
and division don’t play well). I’ll keep poking around for a fix.