simpletransformers: Problems when classifying after finetuning BERT (Multi-Label)

I am following the write-up to a muti-label classification as done here https://towardsdatascience.com/multi-label-classification-using-bert-roberta-xlnet-xlm-and-distilbert-with-simple-transformers-b3e0cda12ce5

I am having some difficulties. I loaded a Dutch base BERT model (from here https://github.com/wietsedv/bertje) and then I train a multi-label model with 50 labels:

import pandas as pd
from sklearn.model_selection import train_test_split

df = pd.read_csv("all_data_withid.csv", encoding="utf8", delimiter=";")
df['labels'] = list(zip(df.label1.tolist(), df.label2.tolist(), ...)) #truncated for brevity
train_df, eval_df = train_test_split(df, test_size=0.3, random_state=123456)

model = MultiLabelClassificationModel('bert', 'bert-base-dutch-cased/bertje-base', num_labels=50, args={'train_batch_size':2, 'gradient_accumulation_steps':16, 'learning_rate': 3e-5, 'num_train_epochs': 1, 'max_seq_length': 512, 'fp16': False})

result, model_outputs, wrong_predictions = model.eval_model(validation_df)

Now the end result is that I get an LRAP score of roughly 0.71. However, now I am a bit puzzled on how to use this model to classify a single new instance. I closed Python, opened it again and loaded my trained model from disk:

model = MultiLabelClassificationModel('bert', 'outputs', num_labels=50, args={'train_batch_size':2, 'gradient_accumulation_steps':16, 'learning_rate': 3e-5, 'num_train_epochs': 1, 'max_seq_length': 512, 'fp16': False}).

I then tried model.predict(["dit is een test"]) and model.predict(["en nog een compleet andere test"]) and as it turns out the resulting outputs and predictions (always all 0s for every class) for these 2 distinct sentences are exactly the same on all values. I also tried to evaluate (result, model_outputs, wrong_predictions = model.eval_model(validation_df)) 3 times on different splits of my dataset but in all scenarios the resulting LRAP is the same ~0.71.

What am I doing wrong here?

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 67 (23 by maintainers)

Most upvoted comments

Lowering the learning rate and/or the number of training epochs seems to be the best solution to prevent the model from breaking completely and predicting the same class.

ThilinaRajapakse on Jun 5, 2020

Same problem here, accuracy of 98% but in prediction only getting 0 for all labels. Tried Albert, Roberta, Bert, distilbert

Edit: Problem solved after completely reinstalling and rebooting

flozi00 on Apr 8, 2020

That is the general practice. Weight decay is not applied to normalization layers and bias weights. My understanding is that it is unnecessary as those don’t usually overfit.

ThilinaRajapakse on Apr 6, 2020

Greetings, I think I solved it - it is the learning rate.

First of all as @ThilinaRajapakse and @venkatasg pointed out about considering the inputs of the classifier as inappropriate is irreverent. I got different results when I changed this, probably by chance.

The learning rate that is applied by defaults is 4e-5 which is OK for fine-tuning transformer-based models but not that good for the classification layer. What I did and got nice predictions was to change the learning rate of the classification layer to 1e-3 while keeping 4e-5 for the transformer-based model. So far I have only tried with Bert models.

So I modified the train function of ClassificationModel from simpletransformers.classification.classification_model from this:

        no_decay = ["bias", "LayerNorm.weight"]
        optimizer_grouped_parameters = [
            {
                "params": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)],
                "weight_decay": args["weight_decay"],
            },
            {
                "params": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)],
                "weight_decay": 0.0,
            }
        ]

to this (needs to add the new argument of learning_rate_classifier to global_args):

        no_decay = ["bias", "LayerNorm.weight"]
        classifier_parameters = ["classifier"]
        optimizer_grouped_parameters = [
            {
                "params": [p for n, p in model.named_parameters() if
                           (not any(nd in n for nd in no_decay) and not any(nd in n for nd in classifier_parameters))],
                "weight_decay": args["weight_decay"],
            },
            {
                "params": [p for n, p in model.named_parameters() if
                           (not any(nd in n for nd in no_decay) and any(nd in n for nd in classifier_parameters))],
                "weight_decay": args["weight_decay"],
                "lr": args["learning_rate_classifier"]
            },
            {
                "params": [p for n, p in model.named_parameters() if
                           (any(nd in n for nd in no_decay) and not any(nd in n for nd in classifier_parameters))],
                "weight_decay": 0.0,
            },
            {
                "params": [p for n, p in model.named_parameters() if
                           (any(nd in n for nd in no_decay) and any(nd in n for nd in classifier_parameters))],
                "weight_decay": 0.0
                "lr": args["learning_rate_classifier"]
            },
        ]

Lysimachos on Mar 31, 2020

You don’t need to handle this manually anymore. Check docs here.

ThilinaRajapakse on Dec 24, 2020

Greetings, I think I solved it - it is the learning rate.

First of all as @ThilinaRajapakse and @venkatasg pointed out about considering the inputs of the classifier as inappropriate is irreverent. I got different results when I changed this, probably by chance.

The learning rate that is applied by defaults is 4e-5 which is OK for fine-tuning transformer-based models but not that good for the classification layer. What I did and got nice predictions was to change the learning rate of the classification layer to 1e-3 while keeping 4e-5 for the transformer-based model. So far I have only tried with Bert models.

So I modified the train function of ClassificationModel from simpletransformers.classification.classification_model from this:
        no_decay = ["bias", "LayerNorm.weight"]
        optimizer_grouped_parameters = [
            {
                "params": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)],
                "weight_decay": args["weight_decay"],
            },
            {
                "params": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)],
                "weight_decay": 0.0,
            }
        ]
to this (needs to add the new argument of learning_rate_classifier to global_args):
        no_decay = ["bias", "LayerNorm.weight"]
        classifier_parameters = ["classifier"]
        optimizer_grouped_parameters = [
            {
                "params": [p for n, p in model.named_parameters() if
                           (not any(nd in n for nd in no_decay) and not any(nd in n for nd in classifier_parameters))],
                "weight_decay": args["weight_decay"],
            },
            {
                "params": [p for n, p in model.named_parameters() if
                           (not any(nd in n for nd in no_decay) and any(nd in n for nd in classifier_parameters))],
                "weight_decay": args["weight_decay"],
                "lr": args["learning_rate_classifier"]
            },
            {
                "params": [p for n, p in model.named_parameters() if
                           (any(nd in n for nd in no_decay) and not any(nd in n for nd in classifier_parameters))],
                "weight_decay": 0.0,
            },
            {
                "params": [p for n, p in model.named_parameters() if
                           (any(nd in n for nd in no_decay) and any(nd in n for nd in classifier_parameters))],
                "weight_decay": 0.0
                "lr": args["learning_rate_classifier"]
            },
        ]

@Lysimachos @ThilinaRajapakse can you please tell me where to add this to simpletransformers code ? I’m doing multi-label classification and I think I’m facing a similar issue, but I don’t know where to add this code to make it work. Thanks!

krannnn on Dec 16, 2020

Thanks @ThilinaRajapakse , I am aware of the NER support, I may actually use a transformers or simple transformers NER model to power the NEL model. But NEL is one step further from NER, as it tries to disambiguate the recognized entity. E.g. a NER model would help me tag “high blood pressure” as a “disease”, but I need it to point out what disease am I exactly talking about (i.e. point out the specific ID from a medical standard knowledge base). I just learned this kind of problem is called NEL.

I recommend utilizing a recent model from ACL 2020 instead of a classifier https://github.com/dmis-lab/BioSyn

tutubalinaev on Jun 17, 2020

Got it!

ThilinaRajapakse on Jun 6, 2020

Seems to change the behavior indeed. Note for others: I did have to remove the cache dir.

ErikTromp on Feb 22, 2020