LightGBM: [r-package] cannot open data file error when reading matrix

When using http://archive.ics.uci.edu/ml/datasets/Bank+Marketing http://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank.zip as outlined below

library(lightgbm)
d <- read.csv2("path/to/bank.csv")
split_factor = 0.5
n_samples = nrow(d)
train_idx <- sample(seq_len(n_samples), size = floor(split_factor * n_samples))
dataTrain <- d[train_idx,]
dataTest  <- d[-train_idx,]

dtrain <- lgb.Dataset(data = as.matrix(dataTrain[,1:16]),
                      label = as.numeric(dataTrain[,"y"]),
                      free_raw_data=FALSE
                      )
bst <- lightgbm(data = dtrain,
                num_leaves = 4,
                learning_rate = 1,
                nrounds = 2,
                objective = "binary",
                verbose= "2"
                )

Results in

error from lightGBM
Argument sollte Zeichenkette der Länge 1 sein, nur das erste Element
wird benutztFehler in lgb.call("LGBM_DatasetCreateFromFile_R", ret = handle, lgb.c_str(private$raw_data),  : 
  api error: cannot open data file 29

Is this an issue with lightGBM or am I using the package wrong?

Environment info

Operating System: osx 10.12.2 CPU: 2,5 GHz Intel Core i7

Error Message:

api error: cannot open data file 29

Reproducible examples

see above

Steps to reproduce

  1. install current master branch, commit id 3db4216a0bd176a57906fdc11a34c77215363733
  2. devtools::install_github("Microsoft/LightGBM", subdir = "R-package")
  3. follow minimal example above

edit example

library(lightgbm)
d <- read.csv2("path/to/bank.csv")
split_factor = 0.5
n_samples = nrow(d)
train_idx <- sample(seq_len(n_samples), size = floor(split_factor * n_samples))

numericd <- d
numericd$job <- as.numeric(numericd$job)
numericd$marital <- as.numeric(numericd$marital)
numericd$education <- as.numeric(numericd$education)
numericd$default <- as.numeric(numericd$default)
numericd$housing <- as.numeric(numericd$housing)
numericd$loan <- as.numeric(numericd$loan)
numericd$contact <- as.numeric(numericd$contact)
numericd$month <- as.numeric(numericd$month)
numericd$poutcome <- as.numeric(numericd$poutcome)
numericd$y <- as.numeric(numericd$y)

dataTrain <- numericd[train_idx,]
dataTest  <- numericd[-train_idx,]

dataTrainX <- as.matrix(dataTrain[,1:16])
dataTrainLabel <- dataTrain[,"y"]

class(dataTrainX) # matrix
class(dataTrainLabel) #numeric

dtrain <- lgb.Dataset(data = dataTrainX,
                      label = dataTrainLabel,
                      free_raw_data=FALSE,
                      is_sparse=FALSE,
                      #colnames=TRUE,
                      colnames= colnames(dataTrainX),
                      categorical_feature = c("job", "martial", "education", "default", "housing", "loan", "contact", "month", "poutcome")
)
bst <- lightgbm(data = dtrain,
                num_leaves = 4,
                learning_rate = 1,
                nrounds = 2,
                objective = "binary",
                verbose= "2"
)

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Comments: 27 (5 by maintainers)

Most upvoted comments

@geoHeil Please check #228 for my fix. Thanks for reporting the issue, I think I fixed it.

Install it with: devtools::install_github("Laurae2/LightGBM", ref = "patch-5", subdir = "R-package")

On the PR228 page, there is code to run, but you can take your code but make sure you use this to start all IDs with 0:

numericd$job <- as.numeric(as.factor(numericd$job)) - 1
numericd$marital <- as.numeric(as.factor(numericd$marital)) - 1
numericd$education <- as.numeric(as.factor(numericd$education)) - 1
numericd$default <- as.numeric(as.factor(numericd$default)) - 1
numericd$housing <- as.numeric(as.factor(numericd$housing)) - 1
numericd$loan <- as.numeric(as.factor(numericd$loan)) - 1
numericd$contact <- as.numeric(as.factor(numericd$contact)) - 1
numericd$month <- as.numeric(as.factor(numericd$month)) - 1
numericd$poutcome <- as.numeric(as.factor(numericd$poutcome)) - 1
numericd$y <- as.numeric(as.factor(numericd$y)) - 1

Then, you do like you want to use categorical_features. Such as:

categorical_feature = c(1, 2) # categorical features are the 1st and 2nd column
categorical_feature = c("job", "month") # categorical features are columns named "job" and "month"

How can I make sure categorical_feature are handled correctly

Check below, but it’s a long story.

@guolinke I think there is an error on how categorical_feature parameter is handled (I fixed it with https://github.com/Microsoft/LightGBM/pull/228 ). I get this error when I try this:

model <- lgb.train(params, temp_train, 2, learning_rate=1, min_data = 1, min_hessian = 1, num_leaves = 1024, categorical_feature = c(1, 2))
Error in data$construct() : 
  'names' attribute [2] must be the same length as the vector [1]

I put lot of debug below, it’s just for future debugging issues.

I use this to load the data in the appropriate format (you didn’t have -1, it causes 0 training issue otherwise):

library(data.table)
library(sparsity)
library(lightgbm)
library(Matrix)

bank <- fread("D:/bank.csv")

numericd <- bank
numericd$job <- as.numeric(as.factor(numericd$job)) - 1
numericd$marital <- as.numeric(as.factor(numericd$marital)) - 1
numericd$education <- as.numeric(as.factor(numericd$education)) - 1
numericd$default <- as.numeric(as.factor(numericd$default)) - 1
numericd$housing <- as.numeric(as.factor(numericd$housing)) - 1
numericd$loan <- as.numeric(as.factor(numericd$loan)) - 1
numericd$contact <- as.numeric(as.factor(numericd$contact)) - 1
numericd$month <- as.numeric(as.factor(numericd$month)) - 1
numericd$poutcome <- as.numeric(as.factor(numericd$poutcome)) - 1
numericd$y <- as.numeric(as.factor(numericd$y)) - 1

Just a quick checking on the data:.

for (i in 1:17) {cat(colnames(bank)[i], " (", typeof(bank[[i]]), ") :", length(unique(bank[[i]])), "\n")}
age  ( integer ) : 67 
job  ( character ) : 12 
marital  ( character ) : 3 
education  ( character ) : 4 
default  ( character ) : 2 
balance  ( integer ) : 2353 
housing  ( character ) : 2 
loan  ( character ) : 2 
contact  ( character ) : 3 
day  ( integer ) : 31 
month  ( character ) : 12 
duration  ( integer ) : 875 
campaign  ( integer ) : 32 
pdays  ( integer ) : 292 
previous  ( integer ) : 24 
poutcome  ( character ) : 4 
y  ( character ) : 2 

I saved data to svmlight for comparison later:

mini_num <- Matrix(Laurae::DT2mat(numericd[, c(2, 11), with = FALSE]), sparse = TRUE)
write.svmlight(mini_num, numericd$y, "D:/bank.svmlight")

Just a quick comparison, models are training, the categorical features are showing up in the dataset correctly:

temp_train <- lgb.Dataset(mini_num, label=numericd$y, free_raw_data = FALSE, colnames = colnames(numericd)[c(2, 11)])
params <- list(objective="binary", metric="l2")
model <- lgb.cv(params, temp_train, 2, folds = Laurae::kfold(numericd$y, 5), learning_rate=1, min_data = 1, min_hessian = 1, num_leaves = 1024)

[LightGBM] [Info] Number of postive: 407, number of negative: 3210
[LightGBM] [Info] Number of data: 3617, number of features: 2
[LightGBM] [Info] Number of postive: 436, number of negative: 3181
[LightGBM] [Info] Number of data: 3617, number of features: 2
[LightGBM] [Info] Number of postive: 407, number of negative: 3210
[LightGBM] [Info] Number of data: 3617, number of features: 2
[LightGBM] [Info] Number of postive: 403, number of negative: 3214
[LightGBM] [Info] Number of data: 3617, number of features: 2
[LightGBM] [Info] Number of postive: 431, number of negative: 3185
[LightGBM] [Info] Number of data: 3616, number of features: 2
[LightGBM] [Info] No further splits with positive gain, best gain: 0.000000, leaves: 124
[LightGBM] [Info] No further splits with positive gain, best gain: 0.000000, leaves: 126
[LightGBM] [Info] No further splits with positive gain, best gain: 0.000000, leaves: 122
[LightGBM] [Info] No further splits with positive gain, best gain: 0.000000, leaves: 121
[LightGBM] [Info] No further splits with positive gain, best gain: 0.000000, leaves: 125
[1]:	valid's l2:0.955904+0.0224389 
[LightGBM] [Info] No further splits with positive gain, best gain: 0.000000, leaves: 114
[LightGBM] [Info] No further splits with positive gain, best gain: 0.000000, leaves: 114
[LightGBM] [Info] No further splits with positive gain, best gain: 0.000000, leaves: 110
[LightGBM] [Info] No further splits with positive gain, best gain: 0.000000, leaves: 114
[LightGBM] [Info] No further splits with positive gain, best gain: 0.000000, leaves: 115
[2]:	valid's l2:1.2283+0.0238211 

temp_train <- lgb.Dataset(mini_num, label=numericd$y, free_raw_data = FALSE)
lgb.Dataset.set.categorical(temp_train, c(1, 2))
temp_train$set_colnames(colnames(numericd)[c(2, 11)])
model <- lgb.cv(params, temp_train, 2, folds = Laurae::kfold(numericd$y, 5), learning_rate=1, min_data = 1, min_hessian = 1, num_leaves = 1024)

<lgb.Dataset>
  Public:
    construct: function () 
    create_valid: function (data, info = list(), ...) 
    dim: function () 
    finalize: function () 
    get_colnames: function () 
    getinfo: function (name) 
    initialize: function (data, params = list(), reference = NULL, colnames = NULL, 
    save_binary: function (fname) 
    set_categorical_feature: function (categorical_feature) 
    set_colnames: function (colnames) 
    set_reference: function (reference) 
    setinfo: function (name, info) 
    slice: function (idxset, ...) 
    update_params: function (params) 
  Private:
    categorical_feature: 1 2
    colnames: job month
    free_raw_data: FALSE
    get_handle: function () 
    handle: NULL
    info: list
    params: list
    predictor: NULL
    raw_data: dgCMatrix
    reference: NULL
    set_predictor: function (predictor) 
    used_indices: NULL

[LightGBM] [Info] Number of postive: 407, number of negative: 3210
[LightGBM] [Info] Number of data: 3617, number of features: 2
[LightGBM] [Info] Number of postive: 436, number of negative: 3181
[LightGBM] [Info] Number of data: 3617, number of features: 2
[LightGBM] [Info] Number of postive: 407, number of negative: 3210
[LightGBM] [Info] Number of data: 3617, number of features: 2
[LightGBM] [Info] Number of postive: 403, number of negative: 3214
[LightGBM] [Info] Number of data: 3617, number of features: 2
[LightGBM] [Info] Number of postive: 431, number of negative: 3185
[LightGBM] [Info] Number of data: 3616, number of features: 2
[LightGBM] [Info] No further splits with positive gain, best gain: 0.000000, leaves: 124
[LightGBM] [Info] No further splits with positive gain, best gain: 0.000000, leaves: 126
[LightGBM] [Info] No further splits with positive gain, best gain: 0.000000, leaves: 122
[LightGBM] [Info] No further splits with positive gain, best gain: 0.000000, leaves: 121
[LightGBM] [Info] No further splits with positive gain, best gain: 0.000000, leaves: 125
[1]:	valid's l2:0.955904+0.0224389 
[LightGBM] [Info] No further splits with positive gain, best gain: 0.000000, leaves: 114
[LightGBM] [Info] No further splits with positive gain, best gain: 0.000000, leaves: 114
[LightGBM] [Info] No further splits with positive gain, best gain: 0.000000, leaves: 110
[LightGBM] [Info] No further splits with positive gain, best gain: 0.000000, leaves: 114
[LightGBM] [Info] No further splits with positive gain, best gain: 0.000000, leaves: 115
[2]:	valid's l2:1.2283+0.0238211 

This will be for comparison with a shell (we will see it is ignoring categorical parameter):

model <- lgb.train(params, temp_train, 2, learning_rate=1, min_data = 1, min_hessian = 1, num_leaves = 1024)

[LightGBM] [Info] Number of postive: 521, number of negative: 4000
[LightGBM] [Info] Number of data: 4521, number of features: 2
[LightGBM] [Info] No further splits with positive gain, best gain: 0.000000, leaves: 126
[LightGBM] [Info] No further splits with positive gain, best gain: 0.000000, leaves: 116

I use this to compare with a shell:

C:/Compiled/LightGBM/windows/x64/Release/lightgbm config=lightgbm_01.conf data=./bank.svmlight objective=binary 2>&1 | tee lightgbm_bank1.log
C:/Compiled/LightGBM/windows/x64/Release/lightgbm config=lightgbm_01.conf data=./bank.svmlight objective=binary categorical_feature=0,1 2>&1 | tee lightgbm_bank2.log

My configuration file lightgbm_01.conf:

num_iterations = 2
min_data = 1
learning_rate = 1
min_hessian = 1
num_leaves = 1024

lightgm_bank1.log (without categoricals, this is what R follows even with categorical parameter setup):

[LightGBM] [Info] Warning: last line of lightgbm_01.conf has no end of line, still using this line
[LightGBM] [Info] Finished loading parameters
[LightGBM] [Info] Finished loading data in 0.027748 seconds
[LightGBM] [Info] Number of postive: 521, number of negative: 4000
[LightGBM] [Info] Number of data: 4521, number of features: 2
[LightGBM] [Info] Finished initializing training
[LightGBM] [Info] Started training...
[LightGBM] [Info] No further splits with positive gain, best gain: 0.000000, leaves: 126
[LightGBM] [Info] 0.004441 seconds elapsed, finished iteration 1
[LightGBM] [Info] No further splits with positive gain, best gain: 0.000000, leaves: 116
[LightGBM] [Info] 0.008047 seconds elapsed, finished iteration 2
[LightGBM] [Info] Finished training

lightgbm_bank2.log (with supposed categoricals):

[LightGBM] [Info] Warning: last line of lightgbm_01.conf has no end of line, still using this line
[LightGBM] [Info] Finished loading parameters
[LightGBM] [Info] Finished loading data in 0.048985 seconds
[LightGBM] [Info] Number of postive: 521, number of negative: 4000
[LightGBM] [Info] Number of data: 4521, number of features: 2
[LightGBM] [Info] Finished initializing training
[LightGBM] [Info] Started training...
[LightGBM] [Info] No further splits with positive gain, best gain: 0.000000, leaves: 103
[LightGBM] [Info] 0.003052 seconds elapsed, finished iteration 1
[LightGBM] [Info] No further splits with positive gain, best gain: -inf, leaves: 102
[LightGBM] [Info] 0.005716 seconds elapsed, finished iteration 2
[LightGBM] [Info] Finished training

@guolinke For debugging in R your functions, I use this:

debug(lgb.train)

You can run this after:

model <- lgb.train(params, temp_train, 2, learning_rate=1, min_data = 1, min_hessian = 1, num_leaves = 1024, categorical_feature = c(1, 2))

When entering the debug mode, run this:

debug(data$construct)

And I get an error here after tracing down all the calls 1 by 1 manually:

if (!is.null(private$colnames)) {
          fname_dict <- `names<-`(
              list((seq_along(private$colnames) - 1)),
              private$colnames
            )
        }

Once I understood how it was working, a simple fix was more than enough. It should also provide a large performance increase for people using thousand/millions of categorical features by avoiding a loop in R.