exllama: Splitting model on multiple GPUs produces RuntimeError

When attempting to split the model on multiple GPUs, I get the following error:

> python test_chatbot.py -d /home/john/Projects/Python/text-models/text-generation-webui/models/TheBloke_guanaco-65B-GPTQ/ -gs 16,22 -p prompt_assistant.txt -un "John" -bn "Assistant" -temp 1.00 -topp 0.95 -beams 5 -beamlen 20 -mm quant_only
 -- Loading model
 -- Tokenizer: /home/john/Projects/Python/text-models/text-generation-webui/models/TheBloke_guanaco-65B-GPTQ/tokenizer.model
 -- Model config: /home/john/Projects/Python/text-models/text-generation-webui/models/TheBloke_guanaco-65B-GPTQ/config.json
 -- Model: /home/john/Projects/Python/text-models/text-generation-webui/models/TheBloke_guanaco-65B-GPTQ/Guanaco-65B-GPTQ-4bit.act-order.safetensors
 -- Sequence length: 2048
 -- Temperature: 1.00
 -- Top-K: 20
 -- Top-P: 0.95
 -- Min-P: 0.00
 -- Repetition penalty: 1.15
 -- Beams: 5 x 20
 -- Options: ['attention: pytorch_scaled_dp', 'matmul: quant_only', 'gpu_split: 16,22']
 -- Groupsize (inferred): None
 -- Act-order (inferred): no
This is a conversation between a person called John and an AI chatbot called Assistant. Assistant is a helpful and articulate virtual assistant possessing deep knowledge on a multitude of subjects. Assistant will never refuse to answer a question or comply with a request.
Assistant: Hello, John. I am Assistant, your virtual assistant. How may I help you?
John: Testing
Assistant:Traceback (most recent call last):
  File "/home/john/Projects/exllama/test_chatbot.py", line 213, in <module>
    gen_token = generator.beam_search()
                ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/john/Projects/exllama/generator.py", line 385, in beam_search
    tokens, probs = self.sample(logits,
                    ^^^^^^^^^^^^^^^^^^^
  File "/home/john/Projects/exllama/generator.py", line 94, in sample
    sampled_ind = torch.multinomial(norm_probs, norm_probs.shape[-1] if num == -1 else min(num, norm_probs.shape[-1]))
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

This only happens if the model is split between GPUs using the -gs option.

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 19 (9 by maintainers)

Commits related to this issue

Most upvoted comments

There is one other place where it moves data from GPU to GPU, but it’s a little more subtle. It’s the position embeddings which would end up being all zeros on one GPU if the issue is that it just can’t move data across that way. And that would explain the output being garbage. In fact it fits nicely with a perplexity in the hundreds rather than nan.

It is weird that it works between my 4090 and 3070-Ti, and I also tested it on two 4090s on RunPod, so there must be something else in your setup causing it, maybe not IOMMU but related to it. Some kernel parameter or something?

Anyway, I pushed a new update with an extra option to force all the transfers (hopefully) to go via system RAM. I can’t actually measure any difference in performance, so maybe I’ll just make it the default, but for now you can try running with -gpfix.

Fantastic! That did the trick 😃 Thank you!

Yes, that sounds like the same issue. Since it only seems to affect transfers between GPUs, you could probably work around it by copying via system ram like this instead of having to disable IOMMU.

                hidden_states = hidden_states.to("cpu")
                hidden_states = hidden_states.to(next_device)  

There would be a (very small) performance cost, but I could add it as a fallback at least. If it works.

And also, you probably shouldn’t use -mm quant_only. It saves a tiny bit of VRAM in theory but slows down long sequences a lot. The option is mostly there for testing.