sentencepiece: terminate called after throwing an instance of 'std::bad_alloc'
I’m running a sentencepiece model and getting an std::bad_alloc
error when I increase the training size from 5M to 10M sentences. (it works fine for 5M sentences). Here’s how I’m calling the function:
spm_train --input=input.txt --vocab_size=32000 --character_coverage=1.0
--model_type=unigram --input_sentence_size=10000000 --num_threads=32
here’s the specific error:
trainer_interface.cc(317) LOG(INFO) Sampled 10000000 sentences from 283087079 sentences.
trainer_interface.cc(321) LOG(INFO) Skipped 209436 too long sentences.
trainer_interface.cc(330) LOG(INFO) Adding meta_piece: <unk>
trainer_interface.cc(330) LOG(INFO) Adding meta_piece: <s>
trainer_interface.cc(330) LOG(INFO) Adding meta_piece: </s>
trainer_interface.cc(335) LOG(INFO) Normalizing sentences...
trainer_interface.cc(384) LOG(INFO) all chars count=3460742236
trainer_interface.cc(392) LOG(INFO) Done: 100% characters are covered.
trainer_interface.cc(402) LOG(INFO) Alphabet size=25
trainer_interface.cc(403) LOG(INFO) Final character coverage=1
trainer_interface.cc(435) LOG(INFO) Done! preprocessed 10000000 sentences.
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
I’ve tried compiling SentencePiece with and without gperftools, and get the same error message. Compiled with gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-16)
, in case that matters. (Edit: also tried a more recent gcc 8.2.0 with the same results.) I doubt that it’s a RAM limitation, I’m running this on a pretty beefy compute node with 768 GB of memory, and watching memory utilization as the program is running (even at 5M input sentences) I never get close to maxing out. Any thoughts why I might be getting this error message?
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Reactions: 12
- Comments: 15
Thank you for the report. Will fix it soon.
Did you try using the new “–max_sentence_length” flag? I haven’t tried yet, myself. My old “fix” is still out there, but there have been many commits since that version, so it’s probably best to start with the current released version and the new flag.
On Thu, Jun 25, 2020 at 1:32 AM Aditya Gupta notifications@github.com wrote:
The latest release added the flag to train model from large corpora (> 10M sentences)
--train_extremely_large_corpus
Note that this flag will increase the memory foot print drastically.
https://github.com/google/sentencepiece/releases/tag/v0.1.9
I will put that together for you.
look at tuglat/sentencepiece, branch fixOverflow. This is a brute force approach, and probably breaks integration compatibility with other tools, like marian. Still good as a standalone tool, however.