sentencepiece: terminate called after throwing an instance of 'std::bad_alloc'

I’m running a sentencepiece model and getting an std::bad_alloc error when I increase the training size from 5M to 10M sentences. (it works fine for 5M sentences). Here’s how I’m calling the function:

spm_train --input=input.txt --vocab_size=32000 --character_coverage=1.0
    --model_type=unigram --input_sentence_size=10000000 --num_threads=32

here’s the specific error:

trainer_interface.cc(317) LOG(INFO) Sampled 10000000 sentences from 283087079 sentences.
trainer_interface.cc(321) LOG(INFO) Skipped 209436 too long sentences.
trainer_interface.cc(330) LOG(INFO) Adding meta_piece: <unk>
trainer_interface.cc(330) LOG(INFO) Adding meta_piece: <s>
trainer_interface.cc(330) LOG(INFO) Adding meta_piece: </s>
trainer_interface.cc(335) LOG(INFO) Normalizing sentences...
trainer_interface.cc(384) LOG(INFO) all chars count=3460742236
trainer_interface.cc(392) LOG(INFO) Done: 100% characters are covered.
trainer_interface.cc(402) LOG(INFO) Alphabet size=25
trainer_interface.cc(403) LOG(INFO) Final character coverage=1
trainer_interface.cc(435) LOG(INFO) Done! preprocessed 10000000 sentences.
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc

I’ve tried compiling SentencePiece with and without gperftools, and get the same error message. Compiled with gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-16), in case that matters. (Edit: also tried a more recent gcc 8.2.0 with the same results.) I doubt that it’s a RAM limitation, I’m running this on a pretty beefy compute node with 768 GB of memory, and watching memory utilization as the program is running (even at 5M input sentences) I never get close to maxing out. Any thoughts why I might be getting this error message?

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Reactions: 12
  • Comments: 15

Most upvoted comments

Thank you for the report. Will fix it soon.

Did you try using the new “–max_sentence_length” flag? I haven’t tried yet, myself. My old “fix” is still out there, but there have been many commits since that version, so it’s probably best to start with the current released version and the new flag.

On Thu, Jun 25, 2020 at 1:32 AM Aditya Gupta notifications@github.com wrote:

Hi @taku910 https://github.com/taku910 , @tuglat https://github.com/tuglat , I am running into the same issue, and my training data is also not very large. I attached the logs below; any help is highly appreciated!

trainer_interface.cc(267) LOG(INFO) Loading corpus: tmp/all_text.out trainer_interface.cc(287) LOG(WARNING) Found too long line (27755 > 20480). trainer_interface.cc(289) LOG(WARNING) Too long lines are skipped in the training. trainer_interface.cc(290) LOG(WARNING) The maximum length can be changed with --max_sentence_length= flag. trainer_interface.cc(139) LOG(INFO) Loaded 1000000 lines trainer_interface.cc(139) LOG(INFO) Loaded 2000000 lines trainer_interface.cc(114) LOG(WARNING) Too many sentences are loaded! (2969902), which may slow down training. trainer_interface.cc(116) LOG(WARNING) Consider using –input_sentence_size= and --shuffle_input_sentence=true. trainer_interface.cc(119) LOG(WARNING) They allow to randomly sample sentences from the entire corpus. trainer_interface.cc(315) LOG(INFO) Loaded all 2969902 sentences trainer_interface.cc(321) LOG(INFO) Skipped 98 too long sentences. trainer_interface.cc(330) LOG(INFO) Adding meta_piece: ▁xxunk trainer_interface.cc(330) LOG(INFO) Adding meta_piece: ▁xxpad trainer_interface.cc(330) LOG(INFO) Adding meta_piece: ▁xxbos trainer_interface.cc(330) LOG(INFO) Adding meta_piece: ▁xxeos trainer_interface.cc(330) LOG(INFO) Adding meta_piece: ▁xxfld trainer_interface.cc(330) LOG(INFO) Adding meta_piece: ▁xxmaj trainer_interface.cc(330) LOG(INFO) Adding meta_piece: ▁xxup trainer_interface.cc(330) LOG(INFO) Adding meta_piece: ▁xxrep trainer_interface.cc(330) LOG(INFO) Adding meta_piece: ▁xxwrep trainer_interface.cc(330) LOG(INFO) Adding meta_piece: trainer_interface.cc(335) LOG(INFO) Normalizing sentences… [I 10:42:28.418 NotebookApp] Saving file at /VoC/Sentiment_Analysis/ULMFiT/QRNN_SentencePiece/Fwd_LM_Amzn.ipynb trainer_interface.cc(384) LOG(INFO) all chars count=2449594649 trainer_interface.cc(392) LOG(INFO) Done: 99.9991% characters are covered. trainer_interface.cc(402) LOG(INFO) Alphabet size=62 trainer_interface.cc(403) LOG(INFO) Final character coverage=0.999991 trainer_interface.cc(435) LOG(INFO) Done! preprocessed 2969902 sentences. [I 10:46:43.313 NotebookApp] Saving file at /VoC/Sentiment_Analysis/ULMFiT/QRNN_SentencePiece/Fwd_LM_Amzn.ipynb terminate called after throwing an instance of ‘std::bad_alloc’ what(): std::bad_alloc

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/google/sentencepiece/issues/405#issuecomment-649229159, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA6S5I6OJ74IXQ3UJDJJV63RYLOQRANCNFSM4I34Y6XA .

The latest release added the flag to train model from large corpora (> 10M sentences) --train_extremely_large_corpus

Note that this flag will increase the memory foot print drastically.

https://github.com/google/sentencepiece/releases/tag/v0.1.9

I will put that together for you.

look at tuglat/sentencepiece, branch fixOverflow. This is a brute force approach, and probably breaks integration compatibility with other tools, like marian. Still good as a standalone tool, however.