LightGBM: [r-package] cannot open data file error when reading matrix
When using http://archive.ics.uci.edu/ml/datasets/Bank+Marketing http://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank.zip as outlined below
library(lightgbm)
d <- read.csv2("path/to/bank.csv")
split_factor = 0.5
n_samples = nrow(d)
train_idx <- sample(seq_len(n_samples), size = floor(split_factor * n_samples))
dataTrain <- d[train_idx,]
dataTest <- d[-train_idx,]
dtrain <- lgb.Dataset(data = as.matrix(dataTrain[,1:16]),
label = as.numeric(dataTrain[,"y"]),
free_raw_data=FALSE
)
bst <- lightgbm(data = dtrain,
num_leaves = 4,
learning_rate = 1,
nrounds = 2,
objective = "binary",
verbose= "2"
)
Results in
error from lightGBM
Argument sollte Zeichenkette der Länge 1 sein, nur das erste Element
wird benutztFehler in lgb.call("LGBM_DatasetCreateFromFile_R", ret = handle, lgb.c_str(private$raw_data), :
api error: cannot open data file 29
Is this an issue with lightGBM or am I using the package wrong?
Environment info
Operating System: osx 10.12.2 CPU: 2,5 GHz Intel Core i7
Error Message:
api error: cannot open data file 29
Reproducible examples
see above
Steps to reproduce
- install current master branch, commit id 3db4216a0bd176a57906fdc11a34c77215363733
devtools::install_github("Microsoft/LightGBM", subdir = "R-package")- follow minimal example above
edit example
library(lightgbm)
d <- read.csv2("path/to/bank.csv")
split_factor = 0.5
n_samples = nrow(d)
train_idx <- sample(seq_len(n_samples), size = floor(split_factor * n_samples))
numericd <- d
numericd$job <- as.numeric(numericd$job)
numericd$marital <- as.numeric(numericd$marital)
numericd$education <- as.numeric(numericd$education)
numericd$default <- as.numeric(numericd$default)
numericd$housing <- as.numeric(numericd$housing)
numericd$loan <- as.numeric(numericd$loan)
numericd$contact <- as.numeric(numericd$contact)
numericd$month <- as.numeric(numericd$month)
numericd$poutcome <- as.numeric(numericd$poutcome)
numericd$y <- as.numeric(numericd$y)
dataTrain <- numericd[train_idx,]
dataTest <- numericd[-train_idx,]
dataTrainX <- as.matrix(dataTrain[,1:16])
dataTrainLabel <- dataTrain[,"y"]
class(dataTrainX) # matrix
class(dataTrainLabel) #numeric
dtrain <- lgb.Dataset(data = dataTrainX,
label = dataTrainLabel,
free_raw_data=FALSE,
is_sparse=FALSE,
#colnames=TRUE,
colnames= colnames(dataTrainX),
categorical_feature = c("job", "martial", "education", "default", "housing", "loan", "contact", "month", "poutcome")
)
bst <- lightgbm(data = dtrain,
num_leaves = 4,
learning_rate = 1,
nrounds = 2,
objective = "binary",
verbose= "2"
)
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Comments: 27 (5 by maintainers)
@geoHeil Please check #228 for my fix. Thanks for reporting the issue, I think I fixed it.
Install it with:
devtools::install_github("Laurae2/LightGBM", ref = "patch-5", subdir = "R-package")On the PR228 page, there is code to run, but you can take your code but make sure you use this to start all IDs with 0:
Then, you do like you want to use categorical_features. Such as:
Check below, but it’s a long story.
@guolinke I think there is an error on how categorical_feature parameter is handled (I fixed it with https://github.com/Microsoft/LightGBM/pull/228 ). I get this error when I try this:
I put lot of debug below, it’s just for future debugging issues.
I use this to load the data in the appropriate format (you didn’t have -1, it causes 0 training issue otherwise):
Just a quick checking on the data:.
I saved data to svmlight for comparison later:
Just a quick comparison, models are training, the categorical features are showing up in the dataset correctly:
This will be for comparison with a shell (we will see it is ignoring categorical parameter):
I use this to compare with a shell:
My configuration file lightgbm_01.conf:
lightgm_bank1.log (without categoricals, this is what R follows even with categorical parameter setup):
lightgbm_bank2.log (with supposed categoricals):
@guolinke For debugging in R your functions, I use this:
You can run this after:
When entering the debug mode, run this:
And I get an error here after tracing down all the calls 1 by 1 manually:
Once I understood how it was working, a simple fix was more than enough. It should also provide a large performance increase for people using thousand/millions of categorical features by avoiding a loop in R.