scikit-learn: Fix gradient boosting quantile regression
Describe the workflow you want to enable
The quantile loss function used for the Gradient Boosting Classifier is too conservative in its predictions for extreme values.
This makes the quantile regression almost equivalent to looking up the dataset’s quantile, which is not really useful.
Describe your proposed solution
Use the same type of loss function as in the scikit-garden package.
Describe alternatives you’ve considered, if relevant
When the GB classifier is overfitting, this behavior seems to be going away.
Additional context
import pandas as pd
from sklearn.datasets import load_boston
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import GradientBoostingRegressor
from skgarden import RandomForestQuantileRegressor
data = load_boston()
X = pd.DataFrame(data=data["data"], columns=data["feature_names"])
y = pd.Series(data=data["target"])
# with sklearn:
gb_learn = GradientBoostingRegressor(loss="quantile", n_estimators=20, max_depth=10)
gb_learn.set_params(alpha=0.5)
gb_learn.fit(X, y)
pred_learn_median = gb_learn.predict(X)
gb_learn.set_params(alpha=0.05)
gb_learn.fit(X, y)
pred_learn_m_ci = gb_learn.predict(X)
gb_learn.set_params(alpha=0.95)
gb_learn.fit(X, y)
pred_learn_p_ci = gb_learn.predict(X)
fig = plt.figure(figsize=(12, 8))
sns.scatterplot(x=y, y=pred_learn_median, label="Median")
sns.scatterplot(x=y, y=pred_learn_m_ci, label="5% quantile")
sns.scatterplot(x=y, y=pred_learn_p_ci, label="95% quantile")
plt.plot([0, 50], [0, 50], c="red")
sns.despine()
plt.xlabel("True value")
plt.ylabel("Predicted value")
plt.show()
# with skgarden
rf_garden = RandomForestQuantileRegressor(n_estimators=20, max_depth=3)
pred_garden_median = rf_garden.predict(X, quantile=50)
pred_garden_m_ci = rf_garden.predict(X, quantile=5)
pred_garden_p_ci = rf_garden.predict(X, quantile=95)
fig = plt.figure(figsize=(12, 8))
sns.scatterplot(x=y, y=pred_garden_median, label="Median")
sns.scatterplot(x=y, y=pred_garden_m_ci, label="5% quantile")
sns.scatterplot(x=y, y=pred_garden_p_ci, label="95% quantile")
plt.plot([0, 50], [0, 50], c="red")
sns.despine()
plt.xlabel("True value")
plt.ylabel("Predicted value")
plt.show()
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 1
- Comments: 22 (9 by maintainers)
Hey @lorentzenchr,
Thanks for your feedback. While I agree that the RF and GB are not 100% comparable, the GB overfits more (see median case which is closer to y=x) and so should be doing better for the quantiles as well, which is not the case.
I still think the skgarden approach has 2 benefits:
because the splits are made to minimise MSE, you can reuse the same model (without retraining) for all quantiles. It relies on the distribution of the samples sharing the same leaves to estimate the quantiles for a particular prediction. I am not sure this approach would work with a gradient boosting model though because the distribution in each leave are not independent.
objectively, it seems pretty clear that the skgarden model is working much better than the sklearn model. For example, consider a point where the true value (y_true) is 15. The sklearn model predicts that the median is around 15.0, which is great, but then goes to predict that the 5% quantile is around 13.0 which seems a bit too close, and the 95% is around 30.0, which seems way too far. The skgarden model makes much more sensible predictions in this case (5%: ~10.0, 50%: ~15.0, 95%: ~20.0). Moreover, it seems extremely dodgy that the sklearn model thinks that when its estimated median is 30.0, its 95% quantile is 30.0, but when its estimated median is 10.0, its 95% quantile is still 30.0.
Ignoring the debate about which loss makes more sense from a theoretical standpoint, this second point makes the sklearn quantile regression model unusable in any practical application.
I think this issue should not be dismissed so quickly and without any debate.