TensorFlowTTS: MB-MelGAN training runs out of memory when starting evaluation

I started fine-tuning the multiband_melgan.v1_24k Universal Vocoder with command:

CUDA_VISIBLE_DEVICES=0 python examples/multiband_melgan/train_multiband_melgan.py \
  --train-dir ./dump_LibriTTSFormatted/train/ \
  --dev-dir ./dump_LibriTTSFormatted/valid/ \
  --outdir ./outdir/MBMELGAN/MBMelgan-Tune-Experiment1 \
  --config ./models/multiband_melgan.v1_24k.yaml \
  --use-norm 1 \
  --pretrained ./models/libritts_24k.h5

After 5000 steps it evaluates, at this point, it runs out of memory and stops.

2020-11-05 18:51:28,699 (base_trainer:138) INFO: (Steps: 4864) Finished 19 epoch training (256 steps per epoch).
[train]:   0%|▏                                                                                                                                                                                   | 5000/4000000 [12:05<161:16:34,  6.88it/s]2020-11-05 18:51:48,400 (base_trainer:566) INFO: (Step: 5000) train_adversarial_loss = 0.0000.
2020-11-05 18:51:48,401 (base_trainer:566) INFO: (Step: 5000) train_subband_spectral_convergence_loss = 0.8443.
2020-11-05 18:51:48,401 (base_trainer:566) INFO: (Step: 5000) train_subband_log_magnitude_loss = 0.8513.
2020-11-05 18:51:48,401 (base_trainer:566) INFO: (Step: 5000) train_fullband_spectral_convergence_loss = 0.8818.
2020-11-05 18:51:48,402 (base_trainer:566) INFO: (Step: 5000) train_fullband_log_magnitude_loss = 0.9599.
2020-11-05 18:51:48,402 (base_trainer:566) INFO: (Step: 5000) train_gen_loss = 1.7687.
2020-11-05 18:51:48,402 (base_trainer:566) INFO: (Step: 5000) train_real_loss = 0.0000.
2020-11-05 18:51:48,403 (base_trainer:566) INFO: (Step: 5000) train_fake_loss = 0.0000.
2020-11-05 18:51:48,403 (base_trainer:566) INFO: (Step: 5000) train_dis_loss = 0.0000.
2020-11-05 18:51:48,411 (base_trainer:418) INFO: (Steps: 5000) Start evaluation.
                          2020-11-05 18:51:52.343952: W tensorflow/core/common_runtime/bfc_allocator.cc:246] Allocator (GPU_0_bfc) ran out of memory trying to allocate 449.88MiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2020-11-05 18:51:52.381361: E tensorflow/stream_executor/cuda/cuda_fft.cc:249] failed to allocate work area.
2020-11-05 18:51:52.381371: E tensorflow/stream_executor/cuda/cuda_fft.cc:426] Initialize Params: rank: 1 elem_count: 683 input_embed: 683 input_stride: 1 input_distance: 683 output_embed: 342 output_stride: 1 output_distance: 342 batch_count: 86272
2020-11-05 18:51:52.381377: F tensorflow/stream_executor/cuda/cuda_fft.cc:435] failed to initialize batched cufft plan with customized allocator:

Any ideas?

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 16 (12 by maintainers)

Most upvoted comments

I tested this and training now passes the evaluation stage without running out of memory, so the fix worked 😃 Thanks

OscarVanL on Nov 6, 2020