audiolm-pytorch: Audio generation failing at FineTransformer

I tried training a model back when the repo was at commit 95e0669dde9c177b807fa6f0a52e4d2e685c47fd and successfully got checkpoints but it crashed when I tried to test the generations. The error message was a hard-to-understand CUDA message:

generating fine:   0%|          | 0/512 [00:00<?, ?it/s]../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [480,0,0], thread: [32,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [480,0,0], thread: [33,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
... many repetitions ...
File "/fsx/itsleonwu/audiolm-pytorch-training/audiolm_pytorch/audiolm_pytorch.py", line 1617, in generate
    _, fine_logits = self.transformer.forward_with_cond_scale(
... more stuff ...

I suspect the problem is some bug with coarse transformer’s eos handling in the coarse transformer, because the generation crashes specifically when the fine transformer is just about to get started. I printed state and found that the coarse token id had a -1, which I think is the result of applying mask_out_after_eos_id. But it turns out that the first index of the -1 was at timestep 121, quantizer 2 (0, 121, 2) which is not in-between a “full” quantizer step-- I’d expect the first -1 to appear somewhere like (batch_index, timestep, 0). Seems plausible that this is consistent with a CUDA issue (I’m guessing -1 when you expect all the indices to be small-ish nonnegative ints could result in some memory bounds violations).

Going to use this issue to track any updates and what I’ve tried-- will be using the script in https://github.com/LWprogramming/audiolm-pytorch-training/blob/main/audiolm_pytorch_demo_laion.py (which I set up to eliminate non-determinism).

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Reactions: 1
  • Comments: 15 (15 by maintainers)

Commits related to this issue

Most upvoted comments

Hm, it seems to work now when I try a different dataset. I’d originally tried to train the model on a tiny dataset (intentionally overfit to see if it can do that) with samples trimmed to exactly data_max_length, and that’s when unaligned eos starts showing up. It still does, but I can just try on input data that’s a bit larger and that should probably be ok.

edit: hang on, I didn’t do the trimming properly. Now I’m not sure what’s causing the issue 🙃

should be resolved, feel free to reopen if any new error pops up!

@LWprogramming that’s true! well, it wouldn’t hurt to keep it in there for now 😄

thanks, it was nice!

Right, everything after eos should disappear based on that masking logic, although I’m a bit confused the relation between this masking and the fine transformer logic you implemented in the FineTransformer change. I think the original issue was that there was an eos token in coarse_token_ids even though it should’ve been masked out by the code you link. This issue only shows up when we actually try to use these coarse_token_ids in FineTransformer’s generate(), which (iiuc) expects the eos to have been masked out correctly.

Oh, in the process of writing this I think I get what you did here? So if previously the FineTransformerWrapper had logic to avoid trying to do anything with eos coarse tokens that weren’t properly masked, you moved that to FineTransformer so it always works. But if the eos should be masked out in CoarseTransformer already, I’m not sure why we see eos by the time we get to FineTransformer anyways.

hope the trip was nice! 😃

Just submitted the job to try (pending, so no results yet) but while we wait, just to check my understanding: this change masks out anything that isn’t an actual coarse index, so the transformer doesn’t learn anything that relies on those special tokens. However, why does this prevent eos from appearing in the wrong spot (i.e. not aligned with the end of a quantizer step)? Or is the goal just to make that really low probability because during training attention never sees eos?

Here is the script I’m trying on some small dataset but you can use an artificial one by uncommenting make_placeholder_dataset(), switching to dataset_folder = f"{prefix}/placeholder_dataset", and setting all the train steps etc to be super low so you can get to the error quickly.

And I confirmed that the eos is the problem because the assertion here triggered when my job ran last night