audiolm-pytorch: Accelerate failing on multi-gpu rng synchronization
I can do semantic but not coarse transformer training right now. Here’s what the error message looks like:
File "/path/to/trainer.py", line 999, in train_step
data_kwargs = dict(zip(self.ds_fields, next(self.dl_iter)))
File "/path/to/trainer.py", line 78, in cycle
for data in dl:
File "/path/to/venv/site-packages/accelerate/data_loader.py", line 367, in iter
synchronize_rng_states(self.rng_types, self.synchronized_generator)
File "/path/to/venv/site-packages/accelerate/utils/random.py", line 100, in synchronize_rng_states
synchronize_rng_state(RNGType(rng_type), generator=generator)
File "/path/to/venv/site-packages/accelerate/utils/random.py", line 95, in synchronize_rng_state
generator.set_state(rng_state)
RuntimeError: Invalid mt19937 state
This is in the trainer.py file. I don’t think the dataloaders are constructed any differently so I’m confused if this is expected (also wasn’t clear what generator means vs. rng type cuda or whatever). Do you have ideas for why this might be failing only on coarse but not semantic?
I found this issue with the same error message but it never got resolved unfortunately, and didn’t find any similar issues besides that one.
About this issue
- Original URL
- State: closed
- Created a year ago
- Reactions: 1
- Comments: 15 (15 by maintainers)
Commits related to this issue
- eliminate parallel training see here: https://github.com/lucidrains/audiolm-pytorch/issues/209#issuecomment-1640777646 — committed to LWprogramming/audiolm-pytorch-training by LWprogramming a year ago
haha yea, we are still in the mainframe days of deep learning. A century from now, maybe it won’t even matter
Ahh ok! I’ll have to rewrite some of my code haha
(for anyone looking at this in the future, I just talked to a friend of mine and they pointed out that training multiple models in parallel either requires moving parameters on and off gpu a lot more, or if they’re small enough, they can still fit in memory but then the batch size is necessarily smaller. I guess I still don’t know what exactly caused things to break but it doesn’t matter so much now.)
Thanks so much!
i can add the error message later today! this is a common gotcha, which i handled before over at imagen-pytorch (which is also multiple networks)
@LWprogramming this is just how neural network training is generally done today, if you have multiple big networks to train
ohh! yeah that’s the issue then
you can only train one network per training script
Yeah, the setup is something like (given some configurable integer
save_every
):train semantic for
save_every
steps, then train coarse forsave_every
steps, then fine. then try sampling, then do anothersave_every
steps per trainer, and repeat. This way we can gradually see what the samples look like as the transformers gradually traini don’t really know, but probably good to rule out an external library as the issue
will get back to this either end of this week or next Monday. going all out on audio again soon
Yeah, using Encodec. Do you suspect that the codec might be the issue somehow?
I also notice (after adding some more prints) that we see some weird behavior:
on device {device}: accelerator has...
device {device} arrived at 2
wait_for_everyone()
shortly after point 2, so it never arrives at3
. That seems likewait_for_everyone()
is either causing or maybe exposing an issue if the other GPUs are already unable to train properly.@LWprogramming i can’t tell from first glance; code looks ok from a quick scan
i may be getting back to audio stuff / TTS later this week, so can help with this issue then
are you using Encodec?