xgboost: Fast histogram algorithm exhibits long start-up time and high memory usage

Two issues in one:

  • Fast histogram xgboost does not start training after a while (stuck in building matrix?)
  • Fast histogram xgboost uses a massive amount of memory (reaching over 50GB RAM when exact xgboost takes only a peak of about 10GB RAM)

Before running, to know:

  • Will not work using Microsoft R versions (incompatible due to Microsoft R spamming messages about sending data collection to Microsoft while compiling packages)
  • Dataset download is 240MB, uncompressed is 2GB, sum of RDS files are 650MB
  • Make sure to have about 25GB RAM free to fast histogram (25GB extra seem committed in memory but never used for unknown reasons)
  • Creating the dataset (3GB, 7,744,201,107,060 elements) requires using my svmlight package in R
  • My scripts can run copy&paste just by changing the folders appropriately in setwd
  • I expected fast histogram to be slower than exact xgboost due to sparsity, but not by that much

Created files info:

  File: url_svmlight.tar.gz
CRC-32: c152f632
   MD4: b05d0a58ad6f53f9ad20a5f350c25df2
   MD5: 2eb74f59d06807a3845f90ed36fe8ded
 SHA-1: 605121dde12ce7baf098e30cd8856ef4ed6d5e69
  File: reput_sparse.rds
CRC-32: 846b1907
   MD4: 2ea36b208118b7d46d18ccb5fad8da98
   MD5: ce371f418463ebc799367ed41f5fd654
 SHA-1: 7f2a73609d72ff61a24a6d176c44752b68954ea6

Environment info

Operating System: Windows Server 2012 R2 (baremetal) - OS doesn’t matter, happens also in Ubuntu 17.04

Compiler: MinGW 7.1

Package used (python/R/jvm/C++): R 3.4

xgboost version used: from source, latest commit of today (e5e7217)

Computer specs:

  • Quad E7-88xx v4 (72 cores), but 3 CPUs were disabled to get rid of NUMA
  • 1TB RAM 1866MHz RAM (256GB RAM available to get rid of NUMA)
  • Not virtualized

Used 8 threads for xgboost because it was for a benchmark comparison.

Steps to reproduce

  1. Download this dataset: http://archive.ics.uci.edu/ml/datasets/URL+Reputation (svmlight format, 121 parts, 2396130 rows, 3231961 features
  2. Load data using my svmlight parser
  3. Train a xgboost model with fast histogram method (default is 255 bins, one can start directly with 15 bins)

What have you tried?

  1. Changing xgboost to exact: starts training immediately
  2. Lower number of threads used to 4: still will not start after 30 minutes
  3. Lowering number of bins to 15: still will not start after 30 minutes
  4. Lowering number of bins and setting number of threads to 1 (using a small laptop with same environment): eventually starts training after 5 minutes

Scripts:

LIBRARY DOWNLOAD:

install.packages("devtools")
install.packages("Matrix")
install.packages("Rcpp")
install.packages("RcppEigen")
devtools::install_github("Laurae2/sparsity")
# Install xgboost separately from source

DATA LOAD:

# Libraries
library(sparsity)
library(Matrix)

# SET YOUR WORKING DIRECTORY
setwd("E:/benchmark_lot/data")

data <- read.svmlight(paste0("Day0.svm"))
data$matrix@Dim[2] <- 3231962L
data$matrix@p[length(data$matrix@p):3231963] <- data$matrix@p[length(data$matrix@p)]
data$matrix <- data$matrix[1:(data$matrix@Dim[1] - 1), ]
label <- (data$labels[1:(data$matrix@Dim[1])] + 1) / 2
data <- data$matrix

new_data <- list()

for (i in 1:120) {
  indexed <- (i %% 10) + (10 * ((i %% 10) == 0))
  new_data[[indexed]] <- read.svmlight(paste0("Day", i, ".svm"))
  new_data[[indexed]]$matrix@Dim[2] <- 3231962L
  new_data[[indexed]]$matrix@p[length(new_data[[indexed]]$matrix@p):3231963] <- new_data[[indexed]]$matrix@p[length(new_data[[indexed]]$matrix@p)]
  new_data[[indexed]]$matrix <- new_data[[indexed]]$matrix[1:(new_data[[indexed]]$matrix@Dim[1] - 1), ]
  label <- c(label, (new_data[[indexed]]$labels[1:(new_data[[indexed]]$matrix@Dim[1])] + 1) / 2)
  
  if ((i %% 10) == 0) {
    
    data <- rbind(data, new_data[[1]]$matrix, new_data[[2]]$matrix, new_data[[3]]$matrix, new_data[[4]]$matrix, new_data[[5]]$matrix, new_data[[6]]$matrix, new_data[[7]]$matrix, new_data[[8]]$matrix, new_data[[9]]$matrix, new_data[[10]]$matrix)
    gc(verbose = FALSE)
    
    cat("Parsed element 'Day", i, ".svm'. Sparsity: ", sprintf("%05.0f", as.numeric(data@Dim[1]) * as.numeric(data@Dim[2]) / length(data@i)), ":1. Balance: ", sprintf("%04.02f", length(label) / sum(label)), ":1.\n", sep = "")
    
  }
  
}

# Save to RDS
gc()
saveRDS(data, file = "reput_sparse.rds", compress = TRUE)

# Save labels
saveRDS(label, file = "reput_label.rds", compress = TRUE)

RUN FAST HISTOGRAM, will take forever(?) to start:

# SET YOUR WORKING DIRECTORY
setwd("E:/benchmark_lot/data")

my_data <- readRDS("reput_sparse.rds")
label <- readRDS("reput_label.rds")

library(xgboost)
data <- xgb.DMatrix(data = my_data, label = label)

gc(verbose = FALSE)
set.seed(11111)
model <- xgb.train(params = list(nthread = 8,
                                 max_depth = 5,
                                 tree_method = "hist",
                                 grow_policy = "depthwise",
                                 eta = 0.10,
                                 max_bin = 15,
                                 eval_metric = "auc",
                                 debug_verbose = 1),
                   data = data,
                   nrounds = 100,
                   watchlist = list(train = data),
                   verbose = 2,
                   early_stopping_rounds = 50)

RUN EXACT, will start immediately, and very fast. 10GB peak:

# SET YOUR WORKING DIRECTORY
setwd("E:/benchmark_lot/data")

my_data <- readRDS("reput_sparse.rds")
label <- readRDS("reput_label.rds")

library(xgboost)
data <- xgb.DMatrix(data = my_data, label = label)

gc(verbose = FALSE)
set.seed(11111)
model <- xgb.train(params = list(nthread = 8,
                                 max_depth = 5,
                                 tree_method = "exact",
                                 eta = 0.10,
                                 eval_metric = "auc"),
                   data = data,
                   nrounds = 100,
                   watchlist = list(train = data),
                   verbose = 2,
                   early_stopping_rounds = 50)

ping @hcho3

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Comments: 24 (16 by maintainers)

Commits related to this issue

Most upvoted comments

@trivialfis This issue is about multiple problems found when using xgboost fast histogram initialization on large datasets with many (millions of) features.

The problems encountered summarized:

  • fast histogram uses significantly more RAM than exact
  • fast histogram training takes a very long time to start while exact starts immediately (histogram initialization is not expected to be in the range of many minutes/hours)
  • the older the CPU generation (Intel), the poorer the scalability of fast histogram when creating the histogram initialization (negative efficiency with a low number of threads)

Possible xgboost solutions:

  • rewrite histogram creation, it was already attempted by @hcho3 in #2493 + #2501 (some side effects for performance / RAM usage, therefore the previous version of histogram initialization is enabled using enable_feature_grouping = 0)
  • attempt histogram creation in parallel for singlethreaded parts which bottlenecks the creation speed, attempted by @hcho3 in #2543
  • allow creating the histogram elsewhere and re-use it for training? (it would allow also to not always create it for training, like LightGBM does)

Possible end-user solutions, from top to bottom in priority:

  • use way less threads (the older the CPU generation, the lower the maximum number of threads which should be used)
  • use less features
  • use exact mode instead of fast histogram
  • upgrade CPU to better CPU generation for better scalability (seems to have a major impact)
  • get higher RAM frequency for additional RAM bandwidth for better scalability
  • fill all RAM channels to get peak RAM bandwidth