xgboost: Change default eval_metric for binary:logistic objective + add warning for missing eval_metric when early stopping is enabled

I stumbled over the default metric of the binary:logistic objective. It seems to be 1-accuracy, which is a rather unfortunate choice. Accuracy is not even a proper scoring rule, see e.g. Wiki.

In my view, it should be “logloss”, which is a strictly proper scoring rule in estimating the expectation under the binary objective.

– Edit

The problem occurs with early stopping without manually setting the eval_metric. The default evaluation metric should at least be a strictly consistent scoring rule.

I am using R with XGBoost version 1.1.1.1.

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 11
  • Comments: 17 (7 by maintainers)

Most upvoted comments

I perfectly agree that changing this default is potentially “breaking”. Still, it might be worth considering it.

Here a famous example: Titanic

library(titanic)
library(xgboost)

head(titanic_train)

y <- "Survived"
x <- c("Pclass", "Sex", "SibSp", "Parch", "Fare")

dtrain <- xgb.DMatrix(data.matrix(titanic_train[, x]), label = titanic_train[[y]])


params <- list(objective = "binary:logistic",
               eval_metric = "logloss",
               learning_rate = 0.1)

fit <- xgb.cv(params = params, 
              data = dtrain,
              nrounds = 1000, # selected by early stopping
              nfold = 5, 
              verbose = FALSE,
              stratified = TRUE,
              early_stopping_rounds = 1)

fit

There is some training, we stop after 25 rounds.

With the default, there is no training and the algo stops after the first round…

Do you think the binary logistic case is the only one where the default metric is inconsistent with the objective?

To new contributors: If you’re reading this and interested in contributing this feature, please comment here. Feel free to ping me with questions.

@mayer79 Yes, let’s change the default for multiclass classification as well.

@mayer79 @lorentzenchr Thanks to the recent discussion, I changed my mind. Let us change the default metric with a clear documentation as well as a run-time warning.

@jameslamb Nice. Yes, let’s throw a warning for a missing eval_metric when early stopping is used. With the warning, the case I mentioned (reproducibility) is also covered, and we can change the default metric.

If we were to change the default, how should we make the transition as painless as possible?

I think you can use missing() to check if eval_metric was not passed, and do something like this:

if (missing(eval_metric)){
     print("Using early stopping without specifying an eval metric. In XGBoost 1.3.0, the default metric used for early stopping was changed from 'accuracy' to 'logloss'. To suppress this warning, explicitly provide an eval_metric")
}

does LightGBM use logloss for L2 regression objective?

In LightGBM, if you use objective = "regression" and don’t provide a metric, L2 is used as objective and as the evaluation metric for early stopping.

For example, if you do this with {lightgbm} 3.0.0 in R, you can test with something like this

library(lightgbm)

data(agaricus.train, package = "lightgbm")
data(agaricus.test, package = "lightgbm")
train <- agaricus.train
test <- agaricus.test

bst <- lgb.train(
    params = list(
        objective = "regression"
        , early_stopping_rounds = 5L
    )
    , data = lgb.Dataset(train$data, label = train$label)
    , valids = list(
        "valid1" = lgb.Dataset(
            test$data
            , label = test$label
        )
    )
    , nrounds = 10L
)

image

@mayer79 How common do you think it is to use early stopping without explicitly specifying the evaluation metric?

I’ve been thinking through this. I think it is ok to change the default to logloss in the next minor release (1.3.x).

I think it is ok for the same training code given the same data to produce a different model between minor releases (1.2.x to 1.3.x).

That won’t cause anyone’s code to raise an exception, won’t have any effect on loading previously-trained models from older versions, and any retraining code should be looking at the performance of a new model based on a validation set and a fixed metric anyway.

As long as the changelog in the release makes it clear that that default was changed and that it only affects the case where you are using early stopping, I don’t think it’ll cause problems.