transformers: What to do about this warning message: "Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification"
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")
returns this warning message:
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
This just started popping up with v.3 so I’m not sure what is the recommended action to take here. Please advise if you can. Basically, any of my code using the AutoModelFor<X> is throwing up this warning now.
Thanks.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 96
- Comments: 53 (12 by maintainers)
@ohmeow you’re loading the
bert-base-casedcheckpoint (which is a checkpoint that was trained using a similar architecture toBertForPreTraining) in aBertForSequenceClassificationmodel.This means that:
BertForPreTraininghas, butBertForSequenceClassificationdoes not have will be discardedBertForSequenceClassificationhas butBertForPreTrainingdoes not have will be randomly initialized.This is expected, and tells you that you won’t have good performance with your
BertForSequenceClassificationmodel before you fine-tune it 🙂.@fliptrail this warning means that during your training, you’re not using the
poolerin order to compute the loss. I don’t know how you’re finetuning your model, but if you’re not using the pooler layer then there’s no need to worry about that warning.You can manage the warnings with the
loggingutility introduced in version 3.1.0:@LysandreJik Thanks for the rapid response, I set it with set_verbosity_error()
@LysandreJik Thank you for your response. I am using the code:
I am only using the
TFBertModel.from_pretrained("bert-base-uncased")pre-built class. I am not initializing it from any other class. Still, I am encountering the warning. From what I can understand this should only appear when initializing given pre-trained model inside another class. Am I fine-tuning correctly? Are the BERT layer weights also getting updated?Warning while loading model:
While attempting to train:
This warning only started to appear from yesterday in all my codes and other sample codes given.
Thanks @LysandreJik
Makes sense.
Now, how do we know what checkpoints are available that were trained on
BertForSequenceClassification?For those who want to suppress the warning for the latest transformers version, try this, hope this helps 😄
@s4sarath I’m not sure I understand your question.
@veronica320, the pooler layer is not used when doing sequence classification, so there’s nothing to be worried about.
The pooler is the second output of the
RobertaModel: https://github.com/huggingface/transformers/blob/v3.4.0/src/transformers/modeling_roberta.py#L691But only the first output is used in the sequence classification model: https://github.com/huggingface/transformers/blob/v3.4.0/src/transformers/modeling_roberta.py#L1002
@LysandreJik I’m having a slightly different issue here - I’m loading a sequence classification checkpoint in a
AutoModelForSequenceClassificationmodel. But I still get the warning. Here’s my code:Output:
I believe it’s NOT expected because I’m indeed initializing from a model that I expect to be exactly identical.
I’m only starting to get this warning after upgrading to transformers v3 as well. I’m using 3.3.1 currently. Could you please help? Thanks!
Anyone knows how to suppress this warning? I am aware that the model needs fine-tuning and I am fine-tuning it so, it becomes annoying to see this over and over again.
You’re right, this has always been the behavior of the models. It wasn’t clear enough before, so we’ve clarified it with this warning.
All of the
BertForXXXmodels consist of a BERT model followed by some head which is task-specific. For sequence classification tasks, the head is just a linear layer which maps the BERT transformer hidden state vector to a vector of lengthnum_labels, wherenum_labelsis the number of classes for your classification task (for example, positive/negative sentiment analysis has 2 labels). If you’re familiar with logits, this final vector contains the logits.In the
transformerssource code, you can see this linear layer (assigned toself.classifier) initialized in the constructor forBertForSequenceClassification:Since
self.classifieris not part of the pre-trained BERT model, its parameters must be initialized randomly (done automatically by thenn.Linearconstructor).@s4sarath Anytime you use code like
model = BertForSequenceClassification.from_pretrained("bert-base-cased"), theself.classifierlinear layer will have to be initialized randomly.@TingNLP You are getting different predictions each time because each time you instantiate the model using
.from_pretrained(), theself.classifierparameters will be different.Thank you for your explanation.
Actually these four variables shouldn’t be initialized randomly, as they’re part of BERT. The official BERT checkpoints contain two heads: the MLM head and the NSP head.
You can see it here:
Among the logging, you should find this:
This tells you two things:
['nsp___cls'], corresponding to the CLS head. Since we’re using a***ForMaskedLM, it makes sense not to use the CLS headIf you’re getting those variables randomly initialized:
then it means you’re using a checkpoint that does not contain these variables. These are the MLM layers, so you’re probably loading a checkpoint that was saved using an architecture that does not contain these layers. This can happen if you do the following:
I hope this answers your question!
I see this same warning when initializing
BertForMaskedLM, pasted in below for good measure. As other posters have mentioned, this warning began appearing only after upgrading to v3.0.0.Note that my module imports/initializations essentially duplicate the snippet demonstrating cloze task usage at https://huggingface.co/bert-large-uncased-whole-word-masking?text=Paris+is+the+[MASK]+of+France.
Am I correct in assuming that nothing has changed in the behavior of the relevant model, but that perhaps this warning should have been being printed all along?
I’ve been using suppressing the warning with this helper:
It does not need to be trained again to be used for a task that it was trained on: e.g., masked language modeling over a very large, general corpus of books and web text in the case of BERT. However, to perform more specific tasks like classification and question answering, such a model must be re-trained, which is called fine-tuning. Since many popular tasks fall in this latter category, it is assumed that most developers will be fine-tuning the models, and hence the developers of Huggingface included this warning message to ensure developers are aware when the model does not appear to have been fine-tuned.
See Advantages of Fine-Tuning at this tutorial: https://mccormickml.com/2019/07/22/BERT-fine-tuning/#12-installing-the-hugging-face-library
Or check out this page from the documentation: https://huggingface.co/transformers/training.html
@ohmeow that really depends on what you want to do! Sequence classification is a large subject, with many different tasks. Here’s a list of all available checkpoints fine-tuned on sequence classification (not all are for BERT, though!)
Please be aware that if you have a specific task in mind, you should fine-tune your model to that task.
I am also encountering the same warning.
When loading the model
When attempting to fine tune it:
Is the model correctly fine-tuning? Are the pre-trained model weights also getting updated (fine-tuned) or only the layers outside(above) the pre-trained model are changing their weights while training?
I will explain bro. Assume classification. Last classification layer is initialised randomly right now. Now, it’s okay, because you haven’t trained it yet.
But once you train the model and save the checkpoint, at the time of inference you are loading that checkpoint. So the prediction remains consistent.
On Wed, Apr 14, 2021, 12:11 PM TingNLP @.***> wrote:
@LysandreJik Hey, What I am not able to understand is that I was using this code for more than 2 weeks and no warning came up till yesterday. I haven’t changed anything but suddenly this warning came up is confusing. I am not getting the same output dimension as before and not able to complete my project.
@fliptrail in your code you have the following:
which means you’re only getting the first output of the model, and using that to compute the loss. The first output of the model is the hidden states:
https://github.com/huggingface/transformers/blob/master/src/transformers/modeling_tf_bert.py#L716-L738
You’re ignoring the second value which is the pooler output. The warnings are normal in your case.