xgboost: Fast histogram algorithm exhibits long start-up time and high memory usage
Two issues in one:
- Fast histogram xgboost does not start training after a while (stuck in building matrix?)
- Fast histogram xgboost uses a massive amount of memory (reaching over 50GB RAM when exact xgboost takes only a peak of about 10GB RAM)
Before running, to know:
- Will not work using Microsoft R versions (incompatible due to Microsoft R spamming messages about sending data collection to Microsoft while compiling packages)
- Dataset download is 240MB, uncompressed is 2GB, sum of RDS files are 650MB
- Make sure to have about 25GB RAM free to fast histogram (25GB extra seem committed in memory but never used for unknown reasons)
- Creating the dataset (3GB, 7,744,201,107,060 elements) requires using my svmlight package in R
- My scripts can run copy&paste just by changing the folders appropriately in
setwd - I expected fast histogram to be slower than exact xgboost due to sparsity, but not by that much
Created files info:
File: url_svmlight.tar.gz
CRC-32: c152f632
MD4: b05d0a58ad6f53f9ad20a5f350c25df2
MD5: 2eb74f59d06807a3845f90ed36fe8ded
SHA-1: 605121dde12ce7baf098e30cd8856ef4ed6d5e69
File: reput_sparse.rds
CRC-32: 846b1907
MD4: 2ea36b208118b7d46d18ccb5fad8da98
MD5: ce371f418463ebc799367ed41f5fd654
SHA-1: 7f2a73609d72ff61a24a6d176c44752b68954ea6
Environment info
Operating System: Windows Server 2012 R2 (baremetal) - OS doesn’t matter, happens also in Ubuntu 17.04
Compiler: MinGW 7.1
Package used (python/R/jvm/C++): R 3.4
xgboost version used: from source, latest commit of today (e5e7217)
Computer specs:
- Quad E7-88xx v4 (72 cores), but 3 CPUs were disabled to get rid of NUMA
- 1TB RAM 1866MHz RAM (256GB RAM available to get rid of NUMA)
- Not virtualized
Used 8 threads for xgboost because it was for a benchmark comparison.
Steps to reproduce
- Download this dataset: http://archive.ics.uci.edu/ml/datasets/URL+Reputation (svmlight format, 121 parts, 2396130 rows, 3231961 features
- Load data using my svmlight parser
- Train a xgboost model with fast histogram method (default is 255 bins, one can start directly with 15 bins)
What have you tried?
- Changing xgboost to exact: starts training immediately
- Lower number of threads used to 4: still will not start after 30 minutes
- Lowering number of bins to 15: still will not start after 30 minutes
- Lowering number of bins and setting number of threads to 1 (using a small laptop with same environment): eventually starts training after 5 minutes
Scripts:
LIBRARY DOWNLOAD:
install.packages("devtools")
install.packages("Matrix")
install.packages("Rcpp")
install.packages("RcppEigen")
devtools::install_github("Laurae2/sparsity")
# Install xgboost separately from source
DATA LOAD:
# Libraries
library(sparsity)
library(Matrix)
# SET YOUR WORKING DIRECTORY
setwd("E:/benchmark_lot/data")
data <- read.svmlight(paste0("Day0.svm"))
data$matrix@Dim[2] <- 3231962L
data$matrix@p[length(data$matrix@p):3231963] <- data$matrix@p[length(data$matrix@p)]
data$matrix <- data$matrix[1:(data$matrix@Dim[1] - 1), ]
label <- (data$labels[1:(data$matrix@Dim[1])] + 1) / 2
data <- data$matrix
new_data <- list()
for (i in 1:120) {
indexed <- (i %% 10) + (10 * ((i %% 10) == 0))
new_data[[indexed]] <- read.svmlight(paste0("Day", i, ".svm"))
new_data[[indexed]]$matrix@Dim[2] <- 3231962L
new_data[[indexed]]$matrix@p[length(new_data[[indexed]]$matrix@p):3231963] <- new_data[[indexed]]$matrix@p[length(new_data[[indexed]]$matrix@p)]
new_data[[indexed]]$matrix <- new_data[[indexed]]$matrix[1:(new_data[[indexed]]$matrix@Dim[1] - 1), ]
label <- c(label, (new_data[[indexed]]$labels[1:(new_data[[indexed]]$matrix@Dim[1])] + 1) / 2)
if ((i %% 10) == 0) {
data <- rbind(data, new_data[[1]]$matrix, new_data[[2]]$matrix, new_data[[3]]$matrix, new_data[[4]]$matrix, new_data[[5]]$matrix, new_data[[6]]$matrix, new_data[[7]]$matrix, new_data[[8]]$matrix, new_data[[9]]$matrix, new_data[[10]]$matrix)
gc(verbose = FALSE)
cat("Parsed element 'Day", i, ".svm'. Sparsity: ", sprintf("%05.0f", as.numeric(data@Dim[1]) * as.numeric(data@Dim[2]) / length(data@i)), ":1. Balance: ", sprintf("%04.02f", length(label) / sum(label)), ":1.\n", sep = "")
}
}
# Save to RDS
gc()
saveRDS(data, file = "reput_sparse.rds", compress = TRUE)
# Save labels
saveRDS(label, file = "reput_label.rds", compress = TRUE)
RUN FAST HISTOGRAM, will take forever(?) to start:
# SET YOUR WORKING DIRECTORY
setwd("E:/benchmark_lot/data")
my_data <- readRDS("reput_sparse.rds")
label <- readRDS("reput_label.rds")
library(xgboost)
data <- xgb.DMatrix(data = my_data, label = label)
gc(verbose = FALSE)
set.seed(11111)
model <- xgb.train(params = list(nthread = 8,
max_depth = 5,
tree_method = "hist",
grow_policy = "depthwise",
eta = 0.10,
max_bin = 15,
eval_metric = "auc",
debug_verbose = 1),
data = data,
nrounds = 100,
watchlist = list(train = data),
verbose = 2,
early_stopping_rounds = 50)
RUN EXACT, will start immediately, and very fast. 10GB peak:
# SET YOUR WORKING DIRECTORY
setwd("E:/benchmark_lot/data")
my_data <- readRDS("reput_sparse.rds")
label <- readRDS("reput_label.rds")
library(xgboost)
data <- xgb.DMatrix(data = my_data, label = label)
gc(verbose = FALSE)
set.seed(11111)
model <- xgb.train(params = list(nthread = 8,
max_depth = 5,
tree_method = "exact",
eta = 0.10,
eval_metric = "auc"),
data = data,
nrounds = 100,
watchlist = list(train = data),
verbose = 2,
early_stopping_rounds = 50)
ping @hcho3
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Comments: 24 (16 by maintainers)
Commits related to this issue
- Use old parallel algorithm for histogram construction It has been reported that new parallel algorithm (#2493) results in excessive message usage (see issue #2326). Until issues are resolved, XGBoost... — committed to hcho3/xgboost by hcho3 7 years ago
- Use old parallel algorithm for histogram construction by default It has been reported that new parallel algorithm (#2493) results in excessive message usage (see issue #2326). Until issues are resolv... — committed to hcho3/xgboost by hcho3 7 years ago
- Use old parallel algorithm for histogram construction by default (#2501) It has been reported that new parallel algorithm (#2493) results in excessive message usage (see issue #2326). Until issues a... — committed to dmlc/xgboost by hcho3 7 years ago
@trivialfis This issue is about multiple problems found when using xgboost fast histogram initialization on large datasets with many (millions of) features.
The problems encountered summarized:
Possible xgboost solutions:
enable_feature_grouping = 0)Possible end-user solutions, from top to bottom in priority: