transformers: training new BERT seems not working

I tried to train a BERT mode from scratch by “run_lm_finetuning.py” with toy training data (samples/sample.txt) by changing the following:

#model = BertForPreTraining.from_pretrained(args.bert_model) bert_config = BertConfig.from_json_file('bert_config.json') model = BertForPreTraining(bert_config)

where the json file comes from BERT-Base, Multilingual Cased

To check the correctness of training, I printed the scores of sequential relationship (for predicting next sentence tasks) in the “pytorch_pretrained_bert/modeling.py” prediction_scores, seq_relationship_score = self.cls(sequence_output, pooled_output) print(seq_relationship_score)

And the result was (just picking an example from a single batch).

Tensor([[-0.1078, -0.2696],
[-0.1425, -0.3207], [-0.0179, -0.2271], [-0.0260, -0.2963], [-0.1410, -0.2506], [-0.0566, -0.3013], [-0.0874, -0.3330], [-0.1568, -0.2580], [-0.0144, -0.3072], [-0.1527, -0.3178], [-0.1288, -0.2998], [-0.0439, -0.3267], [-0.0641, -0.2566], [-0.1496, -0.3696], [ 0.0286, -0.2495], [-0.0922, -0.3002]], device=‘cuda:0’, grad_fn=AddmmBackward)

Notice since the scores for the first column were higher than for the second column, the result showed that the models predicted all batch as not next sentence or next sentence. And this result was universal for all batches. I feel this shouldn’t be the case.

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 17 (8 by maintainers)

Commits related to this issue

Most upvoted comments

Hi guys,

see the paper for TPU training, an estimation is training time using GPUs is about a week using 64 GPUs

Btw, there is an article on this topic http://timdettmers.com/2018/10/17/tpus-vs-gpus-for-transformers-bert/

I was wondering, maybe someone tried tweaking some parameters in the transformer, so that it could converge much faster (ofc, maybe at the expense of accuracy), i.e.:

  • Initializing the embedding layer with FastText / your embeddings of choice - in our tests it boosted accuracy and convergence with more plain models;
  • Using a more standard 200 or 300 dimension embedding instead of 768 (also tweaking the hidden size accordingly);

Personally for me the allure of transformer is not really about the state-of-the-art accuracy, but about having the same architecture applicable for any sort of NLP task (i.e. QA tasks or SQUAD like objectives may require a custom engineering or some non-transferrable models).

Hi @thomwolf,

I trained the model for an hour but the loss is always around 0.6-0.8 and never converges. I know it’s computationally expensive to train the BERT; that’s why I choose the very small dataset (sample.txt, which only has 36 lines).

The main issue is that I have tried the same dataset with the original tensorflow version BERT and it converges within 5 minutes:

next_sentence_accuracy = 1.0 next_sentence_loss = 0.00012585879

That’s why I’m wondering if something is wrong with the model. I have also checked the output of each forward step, and found out that the encoder_layers have similar row values, i.e. rows in the matrix “encoder_layers” are similar to each other. encoded_layers = self.encoder(embedding_output, extended_attention_mask, output_all_encoded_layers=output_all_encoded_layers)