LightGBM: lightgbm.basic.LightGBMError: Label 72 is not less than the number of label mappings (31)
Hello to everyone!!
I am new to Python and Iam getting this error when running LightGBM about a Ranking problem: lightgbm.basic.LightGBMError: Label 72 is not less than the number of label mappings (31)
I tried to search for this error, could not find much useful resources.
I cant guess where the error occurs. Μy dataset consists of 4 columns: [“Frequency”,“Comments”, “Likes”, “Nwords”] as seen below.
# 1) Load Dependencies
import pandas as pd
from numpy import where
import matplotlib.pyplot as plt
import numpy as np
from numpy import unique
from sklearn import metrics
import warnings
warnings.filterwarnings('ignore')
import lightgbm as lgb
gbm = lgb.LGBMRanker()
# 2) Load the Data
# Define Columns
names = ["Frequency","Comments", "Likes", "Nwords"]]
data = pd.read_csv("Posts.csv", encoding="utf-8", sep=";", delimiter=None,
names=names, delim_whitespace=False,
nrows=181,header=0, engine="python")
X = data.values[:,0:2]
y = data.values[:,3]
# 3) Define the Training Data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=1)
query_train = [X_train.shape[0]]
query_val = [X_val.shape[0]]
query_test = [X_test.shape[0]]
# label_values = [label_gain(1,5)]
gbm.fit(X_train, y_train, group=query_train,
eval_set=[(X_val, y_val)], eval_group=[query_val], #values=[label_gain(1,5)],
eval_metric=["ndcg"],eval_at=[1, 2, 3,4,5], early_stopping_rounds=10)
# make predictions
test_pred = gbm.predict(X_test)
X_test["predicted_ranking"] = test_pred
X_test.sort_values("predicted_ranking", ascending=False)
Can anyone help me??
Thank you in advance !!
Sofia
About this issue
- Original URL
- State: open
- Created 3 years ago
- Comments: 18
@sofiavlachou28 Thanks for your interest in LightGBM!
I wrote up a learning-to-rank example tonight to hopefully answer this and other issues you’ve opened regarding
LGBMRankerin the Python package (#5297, #5283).label_gainAs described in https://lightgbm.readthedocs.io/en/latest/Parameters.html#objective:
And as described in https://lightgbm.readthedocs.io/en/latest/Parameters.html#label_gain
groupparameterAs described in https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.LGBMRanker.html#lightgbm.LGBMRanker.fit
This parameter is necessary to tell LightGBM which collections of rows in the training data represent documents from the same “query”. If you aren’t literally working with search engine data (where you have a list of results return by a single search), you might define “query” as, for example, “all movie ratings created by one user”.
Sample Code
I created the following example using Python 3.8.12 and
lightgbminstalled from source from the latest commit onmaster(https://github.com/microsoft/LightGBM/commit/9489f878b3568e70b441e5df602483e116f24cc6).The example below uses LightGBM to build a learning-to-rank model to learn how users in the MovieLens-100K dataset rated different movies.
Looking at the shape of these objects may be information.
The features include some characteristics of the reviewer and some characteristics of the movies.
The target is integer ratings from 0 to 4 (where 0 is very bad and 4 is very good).
And
groupgroups all ratings from one user together as one “query”.This says “the first 272 rows in
Xare one query, then next 62 rows are another query, etc.”.Given data in this format,
LGBMRankercan be used to fit a learning-to-rank model.To check the in-sample fit, you can use something like spearman correlation, which checks how well the ordering of predicted scores matches the actual ratings.
In the Lambdarank application, LightGBM doesn’t give equal weight to all positions in the ranking. For example, it will give higher preference to splits that help it choose correctly between the 1st and 2nd most relevant items than splits that help it choose correctly between the 4th and 5th most relevant movies.
This is where the
label_gainparameter comes in. That parameter describes how much more importance LightGBM places on the ordering of different items.For example, in this dataset with 5 possible ratings, something like the following…
says “correctly ordering the first and second most relevant items is twice as important as correctly ordering the second and third most relevant items”.
I encourage you to try with different values of this parameter, like this:
I hope these examples help! I am going to close and lock #5297 and #5283. If you have other questions about this topic, please ask here.
If you have questions about other LightGBM topics, please open new issues and provide all the information asked for in the issue templatte.
cc @shiyu1994 @StrikerRUS @ffineis please correct me if anything I’ve said above is imprecise or incorrect
@sofiavlachou28 Thanks for using LightGBM! Ranking objectives in LightGBM use
label_gain_to store the gain of each label value. By default,label_gain_[i] = (1 << i) - 1. So the default label gain only works with a maximum label value31. It seems that your dataset contains label values greater than31. So please specify your customizedlabel_gain, as @guolinke mentioned.Re-opening this since there are unanswered questions, but I personally would need to do some research before providing an answer.
@sathyarr I think you are correct about label_gain=[1, 1, 1, 1, 1] represents equal importance of all items we are ranking.
About your second topic i think label_gain refers to number of ratings (label mapings), not the number of items being rated, based on Error raised in this issue: Label x is not less than the number of label mappings (y) So if you have n ratings in your dataset you should have at least same number of values in label_gain list, irrespective of number of items being rated. I can confirm it works when number of items is higher than number of label_gain values.
I think the misunderstanding comes from this:
and it is more correct to say: correctly labeling items as first or second highest score is twice as important as correctly labeling items as second or third highest score This makes sense since model returns score and not direct ordering and multiple items can have same value (at least in training set).
I would be very grateful if @jameslamb or someone with greater understanding of model than me can confirm or deny this, thanks
I had the same problem with label mappings when I tried to use LGBMRanker with Optuna. I did them work well together by following this example. As suggested, I set the
label_gainparameter as:I hope it can also help you!
Best,
Thanks for providing the dataset @sofiavlachou28 .
Looking at the data, I have some observations and a suggestion.
The
labelfor a learning-to-rank problem is expected to be a “relevance score”, explaining how relevant one document is compared to another. (see this Stack Overflow answer for a concise explanation).If you set
label_gainto the maximum value iny_trainas suggested in https://github.com/microsoft/LightGBM/issues/4808#issuecomment-973991963, model training might run without throwing any errors, but (making some assumptions about your data based on column names), I don’t think the model generated will be what you intended.It seems you’re using column
nwordsfor the label, which I assume is “number of words in the post”. If you want to use LightGBM to predict the number of words in a document based on how popular it was (likes,comments), I recommend treating that as a regression problem and usingLGBMRegressor, notLGBMRanker.One other suggestion…I noticed the full dataset has only 91 rows, even before holding out some data for validation.
sample code (click me)
LightGBM has a few parameters to limit model complexity, whose defaults are set to work well with medium-sized datasets (1000s of observations). If you want LightGBM to learn from 92 observations, consider setting a very small value (like 2) for parameter
min_data_in_leaf(link).Hi @sofiavlachou28 , thanks very much for using LightGBM!
I’d be happy to help you, but we need a little more information.
lightgbmare you using and how did you install it?"Posts.csv"), or to replicate this problem using randomly-created data?