LightGBM: lightgbm.basic.LightGBMError: Label 72 is not less than the number of label mappings (31)

Hello to everyone!!

I am new to Python and Iam getting this error when running LightGBM about a Ranking problem: lightgbm.basic.LightGBMError: Label 72 is not less than the number of label mappings (31)

I tried to search for this error, could not find much useful resources.

I cant guess where the error occurs. Μy dataset consists of 4 columns: [“Frequency”,“Comments”, “Likes”, “Nwords”] as seen below.

# 1) Load Dependencies
import pandas as pd
from numpy import where
import matplotlib.pyplot as plt
import numpy as np
from numpy import unique
from sklearn import metrics
import warnings
warnings.filterwarnings('ignore')
import lightgbm as lgb
gbm = lgb.LGBMRanker()




# 2) Load the Data
# Define Columns
names = ["Frequency","Comments", "Likes", "Nwords"]]

data = pd.read_csv("Posts.csv", encoding="utf-8", sep=";", delimiter=None,
                 names=names, delim_whitespace=False,
                 nrows=181,header=0, engine="python")
X = data.values[:,0:2]
y = data.values[:,3]



# 3) Define the Training Data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=1)


query_train = [X_train.shape[0]]
query_val = [X_val.shape[0]]
query_test = [X_test.shape[0]]
# label_values = [label_gain(1,5)]


gbm.fit(X_train, y_train, group=query_train,
        eval_set=[(X_val, y_val)], eval_group=[query_val], #values=[label_gain(1,5)],
        eval_metric=["ndcg"],eval_at=[1, 2, 3,4,5], early_stopping_rounds=10)


# make predictions
test_pred = gbm.predict(X_test)
X_test["predicted_ranking"] = test_pred
X_test.sort_values("predicted_ranking", ascending=False)

Can anyone help me??

Thank you in advance !!

Sofia

About this issue

Most upvoted comments

@sofiavlachou28 Thanks for your interest in LightGBM!

I wrote up a learning-to-rank example tonight to hopefully answer this and other issues you’ve opened regarding LGBMRanker in the Python package (#5297, #5283).


label_gain

As described in https://lightgbm.readthedocs.io/en/latest/Parameters.html#objective:

label_gain can be used to set the gain (weight) of int label and all values in label must be smaller than number of elements in label_gain

And as described in https://lightgbm.readthedocs.io/en/latest/Parameters.html#label_gain

…only used in lambdarank application


group parameter

As described in https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.LGBMRanker.html#lightgbm.LGBMRanker.fit

Group/query data. Only used in the learning-to-rank task. sum(group) = n_samples. For example, if you have a 100-document dataset with group = [10, 20, 40, 10, 10, 10], that means that you have 6 groups, where the first 10 records are in the first group, records 11-30 are in the second group, records 31-70 are in the third group, etc.

This parameter is necessary to tell LightGBM which collections of rows in the training data represent documents from the same “query”. If you aren’t literally working with search engine data (where you have a list of results return by a single search), you might define “query” as, for example, “all movie ratings created by one user”.


Sample Code

I created the following example using Python 3.8.12 and lightgbm installed from source from the latest commit on master (https://github.com/microsoft/LightGBM/commit/9489f878b3568e70b441e5df602483e116f24cc6).

The example below uses LightGBM to build a learning-to-rank model to learn how users in the MovieLens-100K dataset rated different movies.

import io
import os
import zipfile
import lightgbm as lgb
import pandas as pd
import requests
from scipy.stats import spearmanr

def load_movielens(local_dir) -> pd.DataFrame:
    data_url = "https://files.grouplens.org/datasets/movielens/ml-100k.zip"
    if not os.path.isdir(local_dir):
        print(f"creating directory '{local_dir}' to store movielens dataset")
        os.mkdir(local_dir)
    
    zip_file = zipfile.ZipFile(io.BytesIO(requests.get(data_url).content), "r")

    zip_file.extract(
        member="ml-100k/u.data",
        path="data/"
    )
    rating_df = pd.read_csv(
        "data/ml-100k/u.data",
        sep="\t",
        header=None,
        names=["user_id", "item_id", "rating", "timestamp"]
    )
    zip_file.extract(
        member="ml-100k/u.user",
        path="data/"
    )
    user_df = pd.read_csv(
        "data/ml-100k/u.user",
        sep="|",
        encoding="latin-1",
        header=None,
        names=["user_id", "age", "gender", "occupation", "zip_code"]
    )
    zip_file.extract(
        member="ml-100k/u.item",
        path="data/"
    )
    item_df = pd.read_csv(
        "data/ml-100k/u.item",
        sep="|",
        encoding="latin-1",
        header=None,
        names=[
            "movie_id",
            "movie_title",
            "release_date",
            "video_release_date",
            "imdb_url",
            "genre=unknown",
            "genre=Action",
            "genre=Adventure",
            "genre=Animation",
            "genre=Childrens",
            "genre=Comedy",
            "genre=Crime",
            "genre=Documentary",
            "genre=Drama",
            "genre=Fantasy",
            "genre=Film_Noir",
            "genre=Horror",
            "genre=Musical",
            "genre=Mystery",
            "genre=Romance",
            "genre=Sci_Fi",
            "genre=Thriller",
            "genre=War",
            "genre=Western"
        ]
    )
    out_df = rating_df.merge(
        right=user_df,
        how="left",
        on=["user_id"],
        suffixes=("_rating", "_user")
    )
    out_df = out_df.merge(
        right=item_df,
        how="left",
        left_on=["item_id"],
        right_on=["movie_id"],
        suffixes=(None, "_movie")
    )
    # drop join keys and other unnecessary columns
    out_df.drop(["imdb_url", "item_id", "movie_id", "movie_title", "video_release_date", "zip_code"], axis=1, inplace=True)
    out_df = out_df.sort_values(["user_id"], ignore_index=True)

    # LightGBM assumes rankings begin at 0, but these ratings go from 1 to 5
    rating = out_df.pop("rating").values - 1
    
    # use "user_id" to group queries
    user_id = out_df.pop("user_id")
    group = user_id.value_counts(sort=False).values
    
    return out_df, rating, group

# get movielens data
X, y, g = load_movielens("data")

# collapse 1-hot-encoded genre into 1 feature
genre_columns = [c for c in X.columns if c.startswith("genre")]
X["movie_genre"] = X[genre_columns].head().idxmax(1)
X.drop(genre_columns, axis=1, inplace=True)

# create a "movie age" feature
X["movie_age_when_rated"] = (
    pd.to_datetime(X["timestamp"], unit="s") -
    pd.to_datetime(X["release_date"])
) / pd.Timedelta(days=1)
X.drop(["timestamp", "release_date"], axis=1, inplace=True)

# convert "object" columns to unordered categories
for col in X.columns:
    if pd.api.types.is_object_dtype(X[col]):
        X[col] = pd.Categorical(X[col])

Looking at the shape of these objects may be information.

The features include some characteristics of the reviewer and some characteristics of the movies.

print(X.head().to_markdown())
|    |   age | gender   | occupation   | movie_genre   |   movie_age_when_rated |
|---:|------:|:---------|:-------------|:--------------|-----------------------:|
|  0 |    24 | M        | technician   | genre=Crime   |               1362.16  |
|  1 |    24 | M        | technician   | genre=Western |               2133.31  |
|  2 |    24 | M        | technician   | genre=Action  |               6841.15  |
|  3 |    24 | M        | technician   | genre=Comedy  |                362.215 |
|  4 |    24 | M        | technician   | genre=Action  |               1362.16  |

The target is integer ratings from 0 to 4 (where 0 is very bad and 4 is very good).

y[:10]
# array([4, 3, 4, 4, 3, 2, 3, 3, 3, 3])

And group groups all ratings from one user together as one “query”.

g[:10]
# array([272,  62,  54,  24, 175, 211, 403,  59,  22, 184])

This says “the first 272 rows in X are one query, then next 62 rows are another query, etc.”.

Given data in this format, LGBMRanker can be used to fit a learning-to-rank model.

rnk = lgb.LGBMRanker(
    n_estimators=100,
)
rnk.fit(X=X, y=y, group=g)

To check the in-sample fit, you can use something like spearman correlation, which checks how well the ordering of predicted scores matches the actual ratings.

round(spearmanr(y, rnk.predict(X)).correlation, 5)
# 0.21626

In the Lambdarank application, LightGBM doesn’t give equal weight to all positions in the ranking. For example, it will give higher preference to splits that help it choose correctly between the 1st and 2nd most relevant items than splits that help it choose correctly between the 4th and 5th most relevant movies.

This is where the label_gain parameter comes in. That parameter describes how much more importance LightGBM places on the ordering of different items.

For example, in this dataset with 5 possible ratings, something like the following…

label_gain = [1, 2, 4, 8, 16]

says “correctly ordering the first and second most relevant items is twice as important as correctly ordering the second and third most relevant items”.

I encourage you to try with different values of this parameter, like this:

rnk = lgb.LGBMRanker(
    n_estimators=100,
    label_gain=[1, 2, 4, 8, 16]
)
rnk.fit(X=X, y=y, group=g)

I hope these examples help! I am going to close and lock #5297 and #5283. If you have other questions about this topic, please ask here.

If you have questions about other LightGBM topics, please open new issues and provide all the information asked for in the issue templatte.


cc @shiyu1994 @StrikerRUS @ffineis please correct me if anything I’ve said above is imprecise or incorrect

@sofiavlachou28 Thanks for using LightGBM! Ranking objectives in LightGBM use label_gain_ to store the gain of each label value. By default, label_gain_[i] = (1 << i) - 1. So the default label gain only works with a maximum label value 31. It seems that your dataset contains label values greater than 31. So please specify your customized label_gain, as @guolinke mentioned.

Re-opening this since there are unanswered questions, but I personally would need to do some research before providing an answer.

@sathyarr I think you are correct about label_gain=[1, 1, 1, 1, 1] represents equal importance of all items we are ranking.

About your second topic i think label_gain refers to number of ratings (label mapings), not the number of items being rated, based on Error raised in this issue: Label x is not less than the number of label mappings (y) So if you have n ratings in your dataset you should have at least same number of values in label_gain list, irrespective of number of items being rated. I can confirm it works when number of items is higher than number of label_gain values.

I think the misunderstanding comes from this:

says “correctly ordering the first and second most relevant items is twice as important as correctly ordering the second and third most relevant items”.

and it is more correct to say: correctly labeling items as first or second highest score is twice as important as correctly labeling items as second or third highest score This makes sense since model returns score and not direct ordering and multiple items can have same value (at least in training set).

I would be very grateful if @jameslamb or someone with greater understanding of model than me can confirm or deny this, thanks

I had the same problem with label mappings when I tried to use LGBMRanker with Optuna. I did them work well together by following this example. As suggested, I set the label_gain parameter as:

[i for i in range(max(y_train.max(), y_valid.max()) + 1)]

I hope it can also help you!

Best,

Thanks for providing the dataset @sofiavlachou28 .

Looking at the data, I have some observations and a suggestion.

The label for a learning-to-rank problem is expected to be a “relevance score”, explaining how relevant one document is compared to another. (see this Stack Overflow answer for a concise explanation).

If you set label_gain to the maximum value in y_train as suggested in https://github.com/microsoft/LightGBM/issues/4808#issuecomment-973991963, model training might run without throwing any errors, but (making some assumptions about your data based on column names), I don’t think the model generated will be what you intended.

It seems you’re using column nwords for the label, which I assume is “number of words in the post”. If you want to use LightGBM to predict the number of words in a document based on how popular it was (likes, comments), I recommend treating that as a regression problem and using LGBMRegressor, not LGBMRanker.

One other suggestion…I noticed the full dataset has only 91 rows, even before holding out some data for validation.

sample code (click me)
import lightgbm as lgb
import pandas as pd

data_url = "https://github.com/microsoft/LightGBM/files/7569237/Posts.csv"

feature_names = ["Frequency","Comments", "Likes", "Nwords"]

df = pd.read_csv(
    filepath_or_buffer=data_url,
    delimiter=";",
    encoding="utf-8",
    names=feature_names,
    delim_whitespace=False,
    header=0
)
df.shape

LightGBM has a few parameters to limit model complexity, whose defaults are set to work well with medium-sized datasets (1000s of observations). If you want LightGBM to learn from 92 observations, consider setting a very small value (like 2) for parameter min_data_in_leaf (link).

Hi @sofiavlachou28 , thanks very much for using LightGBM!

I’d be happy to help you, but we need a little more information.

  1. What version of lightgbm are you using and how did you install it?
  2. Are you able to provide access to the raw data ("Posts.csv"), or to replicate this problem using randomly-created data?