xgboost: Fast histogram algorithm exhibits long start-up time and high memory usage

Two issues in one:

Fast histogram xgboost does not start training after a while (stuck in building matrix?)
Fast histogram xgboost uses a massive amount of memory (reaching over 50GB RAM when exact xgboost takes only a peak of about 10GB RAM)

Before running, to know:

Will not work using Microsoft R versions (incompatible due to Microsoft R spamming messages about sending data collection to Microsoft while compiling packages)
Dataset download is 240MB, uncompressed is 2GB, sum of RDS files are 650MB
Make sure to have about 25GB RAM free to fast histogram (25GB extra seem committed in memory but never used for unknown reasons)
Creating the dataset (3GB, 7,744,201,107,060 elements) requires using my svmlight package in R
My scripts can run copy&paste just by changing the folders appropriately in setwd
I expected fast histogram to be slower than exact xgboost due to sparsity, but not by that much

Created files info:

  File: url_svmlight.tar.gz
CRC-32: c152f632
   MD4: b05d0a58ad6f53f9ad20a5f350c25df2
   MD5: 2eb74f59d06807a3845f90ed36fe8ded
 SHA-1: 605121dde12ce7baf098e30cd8856ef4ed6d5e69

  File: reput_sparse.rds
CRC-32: 846b1907
   MD4: 2ea36b208118b7d46d18ccb5fad8da98
   MD5: ce371f418463ebc799367ed41f5fd654
 SHA-1: 7f2a73609d72ff61a24a6d176c44752b68954ea6

Environment info

Operating System: Windows Server 2012 R2 (baremetal) - OS doesn’t matter, happens also in Ubuntu 17.04

Compiler: MinGW 7.1

Package used (python/R/jvm/C++): R 3.4

xgboost version used: from source, latest commit of today (e5e7217)

Computer specs:

Quad E7-88xx v4 (72 cores), but 3 CPUs were disabled to get rid of NUMA
1TB RAM 1866MHz RAM (256GB RAM available to get rid of NUMA)
Not virtualized

Used 8 threads for xgboost because it was for a benchmark comparison.

Steps to reproduce

Download this dataset: http://archive.ics.uci.edu/ml/datasets/URL+Reputation (svmlight format, 121 parts, 2396130 rows, 3231961 features
Load data using my svmlight parser
Train a xgboost model with fast histogram method (default is 255 bins, one can start directly with 15 bins)

What have you tried?

Changing xgboost to exact: starts training immediately
Lower number of threads used to 4: still will not start after 30 minutes
Lowering number of bins to 15: still will not start after 30 minutes
Lowering number of bins and setting number of threads to 1 (using a small laptop with same environment): eventually starts training after 5 minutes

Scripts:

LIBRARY DOWNLOAD:

install.packages("devtools")
install.packages("Matrix")
install.packages("Rcpp")
install.packages("RcppEigen")
devtools::install_github("Laurae2/sparsity")
# Install xgboost separately from source

DATA LOAD:

# Libraries
library(sparsity)
library(Matrix)

# SET YOUR WORKING DIRECTORY
setwd("E:/benchmark_lot/data")

data <- read.svmlight(paste0("Day0.svm"))
data$matrix@Dim[2] <- 3231962L
data$matrix@p[length(data$matrix@p):3231963] <- data$matrix@p[length(data$matrix@p)]
data$matrix <- data$matrix[1:(data$matrix@Dim[1] - 1), ]
label <- (data$labels[1:(data$matrix@Dim[1])] + 1) / 2
data <- data$matrix

new_data <- list()

for (i in 1:120) {
  indexed <- (i %% 10) + (10 * ((i %% 10) == 0))
  new_data[[indexed]] <- read.svmlight(paste0("Day", i, ".svm"))
  new_data[[indexed]]$matrix@Dim[2] <- 3231962L
  new_data[[indexed]]$matrix@p[length(new_data[[indexed]]$matrix@p):3231963] <- new_data[[indexed]]$matrix@p[length(new_data[[indexed]]$matrix@p)]
  new_data[[indexed]]$matrix <- new_data[[indexed]]$matrix[1:(new_data[[indexed]]$matrix@Dim[1] - 1), ]
  label <- c(label, (new_data[[indexed]]$labels[1:(new_data[[indexed]]$matrix@Dim[1])] + 1) / 2)
  
  if ((i %% 10) == 0) {
    
    data <- rbind(data, new_data[[1]]$matrix, new_data[[2]]$matrix, new_data[[3]]$matrix, new_data[[4]]$matrix, new_data[[5]]$matrix, new_data[[6]]$matrix, new_data[[7]]$matrix, new_data[[8]]$matrix, new_data[[9]]$matrix, new_data[[10]]$matrix)
    gc(verbose = FALSE)
    
    cat("Parsed element 'Day", i, ".svm'. Sparsity: ", sprintf("%05.0f", as.numeric(data@Dim[1]) * as.numeric(data@Dim[2]) / length(data@i)), ":1. Balance: ", sprintf("%04.02f", length(label) / sum(label)), ":1.\n", sep = "")
    
  }
  
}

# Save to RDS
gc()
saveRDS(data, file = "reput_sparse.rds", compress = TRUE)

# Save labels
saveRDS(label, file = "reput_label.rds", compress = TRUE)

RUN FAST HISTOGRAM, will take forever(?) to start:

# SET YOUR WORKING DIRECTORY
setwd("E:/benchmark_lot/data")

my_data <- readRDS("reput_sparse.rds")
label <- readRDS("reput_label.rds")

library(xgboost)
data <- xgb.DMatrix(data = my_data, label = label)

gc(verbose = FALSE)
set.seed(11111)
model <- xgb.train(params = list(nthread = 8,
                                 max_depth = 5,
                                 tree_method = "hist",
                                 grow_policy = "depthwise",
                                 eta = 0.10,
                                 max_bin = 15,
                                 eval_metric = "auc",
                                 debug_verbose = 1),
                   data = data,
                   nrounds = 100,
                   watchlist = list(train = data),
                   verbose = 2,
                   early_stopping_rounds = 50)

RUN EXACT, will start immediately, and very fast. 10GB peak:

# SET YOUR WORKING DIRECTORY
setwd("E:/benchmark_lot/data")

my_data <- readRDS("reput_sparse.rds")
label <- readRDS("reput_label.rds")

library(xgboost)
data <- xgb.DMatrix(data = my_data, label = label)

gc(verbose = FALSE)
set.seed(11111)
model <- xgb.train(params = list(nthread = 8,
                                 max_depth = 5,
                                 tree_method = "exact",
                                 eta = 0.10,
                                 eval_metric = "auc"),
                   data = data,
                   nrounds = 100,
                   watchlist = list(train = data),
                   verbose = 2,
                   early_stopping_rounds = 50)

ping @hcho3

About this issue

Original URL
State: closed
Created 7 years ago
Comments: 24 (16 by maintainers)

Commits related to this issue

Use old parallel algorithm for histogram construction It has been reported that new parallel algorithm (#2493) results in excessive message usage (see issue #2326). Until issues are resolved, XGBoost... — committed to hcho3/xgboost by hcho3 7 years ago
Use old parallel algorithm for histogram construction by default It has been reported that new parallel algorithm (#2493) results in excessive message usage (see issue #2326). Until issues are resolv... — committed to hcho3/xgboost by hcho3 7 years ago
Use old parallel algorithm for histogram construction by default (#2501) It has been reported that new parallel algorithm (#2493) results in excessive message usage (see issue #2326). Until issues a... — committed to dmlc/xgboost by hcho3 7 years ago

Most upvoted comments

@trivialfis This issue is about multiple problems found when using xgboost fast histogram initialization on large datasets with many (millions of) features.

The problems encountered summarized:

fast histogram uses significantly more RAM than exact
fast histogram training takes a very long time to start while exact starts immediately (histogram initialization is not expected to be in the range of many minutes/hours)
the older the CPU generation (Intel), the poorer the scalability of fast histogram when creating the histogram initialization (negative efficiency with a low number of threads)

Possible xgboost solutions:

rewrite histogram creation, it was already attempted by @hcho3 in #2493 + #2501 (some side effects for performance / RAM usage, therefore the previous version of histogram initialization is enabled using enable_feature_grouping = 0)
attempt histogram creation in parallel for singlethreaded parts which bottlenecks the creation speed, attempted by @hcho3 in #2543
allow creating the histogram elsewhere and re-use it for training? (it would allow also to not always create it for training, like LightGBM does)

Possible end-user solutions, from top to bottom in priority:

use way less threads (the older the CPU generation, the lower the maximum number of threads which should be used)
use less features
use exact mode instead of fast histogram
upgrade CPU to better CPU generation for better scalability (seems to have a major impact)
get higher RAM frequency for additional RAM bandwidth for better scalability
fill all RAM channels to get peak RAM bandwidth

Laurae2 on Jan 13, 2019