audiolm-pytorch: Accelerate failing on multi-gpu rng synchronization

I can do semantic but not coarse transformer training right now. Here’s what the error message looks like:

File "/path/to/trainer.py", line 999, in train_step
data_kwargs = dict(zip(self.ds_fields, next(self.dl_iter)))
File "/path/to/trainer.py", line 78, in cycle
for data in dl:
File "/path/to/venv/site-packages/accelerate/data_loader.py", line 367, in iter
synchronize_rng_states(self.rng_types, self.synchronized_generator)
File "/path/to/venv/site-packages/accelerate/utils/random.py", line 100, in synchronize_rng_states
synchronize_rng_state(RNGType(rng_type), generator=generator)
File "/path/to/venv/site-packages/accelerate/utils/random.py", line 95, in synchronize_rng_state
generator.set_state(rng_state)
RuntimeError: Invalid mt19937 state

This is in the trainer.py file. I don’t think the dataloaders are constructed any differently so I’m confused if this is expected (also wasn’t clear what generator means vs. rng type cuda or whatever). Do you have ideas for why this might be failing only on coarse but not semantic?

I found this issue with the same error message but it never got resolved unfortunately, and didn’t find any similar issues besides that one.

About this issue

Original URL
State: closed
Created a year ago
Reactions: 1
Comments: 15 (15 by maintainers)

Commits related to this issue

eliminate parallel training see here: https://github.com/lucidrains/audiolm-pytorch/issues/209#issuecomment-1640777646 — committed to LWprogramming/audiolm-pytorch-training by LWprogramming a year ago

Most upvoted comments

haha yea, we are still in the mainframe days of deep learning. A century from now, maybe it won’t even matter

lucidrains on Jul 18, 2023

Ahh ok! I’ll have to rewrite some of my code haha

(for anyone looking at this in the future, I just talked to a friend of mine and they pointed out that training multiple models in parallel either requires moving parameters on and off gpu a lot more, or if they’re small enough, they can still fit in memory but then the batch size is necessarily smaller. I guess I still don’t know what exactly caused things to break but it doesn’t matter so much now.)

Thanks so much!

LWprogramming on Jul 18, 2023

i can add the error message later today! this is a common gotcha, which i handled before over at imagen-pytorch (which is also multiple networks)

lucidrains on Jul 18, 2023

@LWprogramming this is just how neural network training is generally done today, if you have multiple big networks to train

lucidrains on Jul 18, 2023

ohh! yeah that’s the issue then

you can only train one network per training script

lucidrains on Jul 18, 2023

Yeah, the setup is something like (given some configurable integer save_every):

train semantic for save_every steps, then train coarse for save_every steps, then fine. then try sampling, then do another save_every steps per trainer, and repeat. This way we can gradually see what the samples look like as the transformers gradually train

LWprogramming on Jul 18, 2023

i don’t really know, but probably good to rule out an external library as the issue

will get back to this either end of this week or next Monday. going all out on audio again soon

lucidrains on Jul 18, 2023

Yeah, using Encodec. Do you suspect that the codec might be the issue somehow?

I also notice (after adding some more prints) that we see some weird behavior:

all the GPUs make it to on device {device}: accelerator has...
only the main GPU makes it to device {device} arrived at 2
main gpu crashes at the wait_for_everyone() shortly after point 2, so it never arrives at 3. That seems like wait_for_everyone() is either causing or maybe exposing an issue if the other GPUs are already unable to train properly.

LWprogramming on Jul 18, 2023

@LWprogramming i can’t tell from first glance; code looks ok from a quick scan

i may be getting back to audio stuff / TTS later this week, so can help with this issue then

are you using Encodec?

lucidrains on Jul 18, 2023