bert: How to freeze layers of bert?

How to freeze all layers of Bert and just train task-based layers during the fine-tuning process? We can do it by setting the requires_grad=false for all layers In pytorch-pretrained-BERT. But is there any way in tensorflow code? I added below code to create_optimizer function in optimization.py

tvars = tf.trainable_variables()
tvars = [v for v in tvars if 'bert' not in v.name]   ## my code (freeze all layers of bert)
grads = tf.gradients(loss, tvars)

is that correct?

About this issue

  • Original URL
  • State: open
  • Created 5 years ago
  • Reactions: 1
  • Comments: 15 (5 by maintainers)

Most upvoted comments

@shimafoolad

I don’t understand your question but check out my fork of BERT.

This is the part that makes sure only the layers added on top of BERT are updated during finetuning.

I’ve also written a script to compare the weights given two checkpoint files and print the weights that differ. I finetuned BERT on CoLA and compare the checkpoint files at step 0 and 267. As expected, only the weights associated with output weights and output_bias are different:

image

I hope this answers your question.

@hkvision Try finetuning with longer epochs and higher learning rate. I finetuned on the CoLA dataset using default hyperparameters and here is my results after 5 epochs:

image

This is what I get after 50 epochs and progressively increasing the learning rate by a few orders of magnitude: image

@OYE93

Have a look at this line.

tvars now contains a list of all weights outside BERT. You will need to add to it the params from layer 11 onwards. You can check the checkpoint files for how these weights are named.

@hsm207 Thanks so much! I’m just doing some experiments on my own to see the impact of freezing BERT. As I’m running BERT on CPU, freezing BERT will be much faster… I understand that freezing BERT the number of parameters are really limited. At least we need to make more efforts to make the result compatible with not freezing BERT.