bitsandbytes: NaN error when using a GPU with no support for igemmlt

I get RuntimeError: probability tensor contains either inf, nan or element < 0 on most language models when trying to run them in 8bit.

I adapted a script made by lorr1 https://github.com/TimDettmers/bitsandbytes/issues/42#issue-1384920163 into a small script that first runs the model using 8bit with igemmlt and then disables the support for igemmlt and runs it again. I tested this on an RTX 3060 and the result is the RuntimeError when running without igemmlt. I think there is a bug in the code that replaces igemmlt on older GPUs.

Interestingly, it works on some models, like EleutherAI/pythia-70m-deduped, EleutherAI/gpt-neo-125M, facebook/opt-6.7b, but on most others it fails with the RuntimeError. When run with EleutherAI/pythia-410m-deduped it outputs the following:

» python 8bit_test.py

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
8bit-reg:
Q: On average Joe throws 25 punches per minute. A fight lasts 5 rounds of 3 minutes.
How many punches did he throw?

A: Let’s think step by step.

First, Joe threw a baseball cap.
Next, he threw a bat in the air.
Joe threw a bat in the air.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Traceback (most recent call last):
  File "/media/veryhighspeed/koboldai/client/8bit_test.py", line 57, in <module>
    generated_ids_8bit = model_8bit.generate(input_ids, max_length=len(input_ids[0]) + MAX_NEW_TOKENS, do_sample=True)
  File "/media/veryhighspeed/koboldai/client/8bit-venv/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/media/veryhighspeed/koboldai/client/8bit-venv/lib/python3.10/site-packages/transformers/generation/utils.py", line 1437, in generate
    return self.sample(
  File "/media/veryhighspeed/koboldai/client/8bit-venv/lib/python3.10/site-packages/transformers/generation/utils.py", line 2479, in sample
    next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

@Ph0rk0z in https://github.com/TimDettmers/bitsandbytes/issues/131#issuecomment-1418274961 also ran into this issue.

About this issue

Original URL
State: closed
Created a year ago
Comments: 20

Links to this issue

Why aren't we using highly efficient int8 calcualtions in quants? (maybe eli14?)

Most upvoted comments

@0cc4m @Opdoop

I edit /site-packages/bitsandbytes/autograd/_functions.py

first at #406

        else:
            A_wo_outliers = A.clone()
            if state.idx is not None:
                A_wo_outliers[:, state.idx.long()] = 0
            #output = torch.nn.functional.linear(A_wo_outliers, state.CB.to(A.dtype))
            #output = output.mul_(state.SCB.unsqueeze(0).mul(1.0 / 127.0))
            CB = state.CB.to(A.dtype).mul_(state.SCB.unsqueeze(1).mul(1.0 / 127.0))
            output = torch.nn.functional.linear(A_wo_outliers, CB)
            if bias is not None:
                output = output.add_(bias)

then at 468:


        if req_gradA:
            if state.CBt is not None:
                C32grad, Sgrad = F.transform(Cgrad, "col32")
                if state.CxBt is None:
                    state.CxBt, state.SBt = F.transform(state.CBt, to_order=formatB, transpose=True)
                gradA32, SgradA32 = F.igemmlt(C32grad, state.CxBt, Sgrad, state.SBt)
                CB = state.CB.to(ctx.dtype_A, copy=True).mul_(state.SCB.unsqueeze(1).mul(1.0 / 127.0))
                grad_A = torch.matmul(grad_output, CB).view(ctx.grad_shape).to(ctx.dtype_A)
                #grad_A = F.mm_dequant(gradA32, SgradA32, SCgrad, state.SCBt).view(ctx.grad_shape).to(ctx.dtype_A)

and now pythia-12b in 8bits at 1.5 threshold no longer NaN on me.

I then switch to full 6.0 threshold and run inference again! It works!

@richardwth you are a hero, you fixed this bug and nobody noticed!

wahoo! https://github.com/TimDettmers/bitsandbytes/pull/335

Ph0rk0z on Apr 22, 2023

No, I think it corresponds to the threshold mentioned in the bitsandbytes README, which defaults to 6.0. That explains why it works on older cards with 0.8, it doesn’t convert much to 8-bit anymore.

0cc4m on Mar 6, 2023

@zhaoqf123 Hi buddy, sorry for the late reply. I did not use any advanced methods as you used here. I manually inserted break points and used torch.isnan, torch.isinf or torch.isfinite to check which transformer layer and later on which line exactly gave the infinite results.

richardwth on May 24, 2023

I came across a similar problem when finetuning Llama 7B: the hidden states became inf at LlamaMLP (specifically, down_proj). I used V100 with device_capability 7.0 so igemmlt is not supported naturally. Then I found the inf happens at this line of autograd._functions.MatMul8bitLt

# (line 390) 3. Matmul, else branch
output = torch.nn.functional.linear(A_wo_outliers, state.CB.to(A.dtype))
output = output.mul_(state.SCB.unsqueeze(0).mul(1.0 / 127.0))

The inf happens because output has some values larger than 65536 at F.linear.

As I understand, state.CB ranges between -127 and 127 and is relatively larger than A_wo_outliers (which is confined by threshold 6.0). Wouldn’t it be safer to calculate CB first then do F.linear? That is,

CB = state.CB.to(A.dtype).mul_(state.SCB.unsqueeze(1).mul(1.0 / 127.0))
output = torch.nn.functional.linear(A_wo_outliers, CB)

Is it designed to prevent underflow? I also notice that CB is calculated first in the backward pass (line 455).

CB = state.CB.to(ctx.dtype_A, copy=True).mul_(state.SCB.unsqueeze(1).mul(1.0 / 127.0))
grad_A = torch.matmul(grad_output, CB).view(ctx.grad_shape).to(ctx.dtype_A)

richardwth on Apr 3, 2023

Only fix thus far is to lower the threshold for int8 like they did here: https://gist.github.com/whjms/2505ef082a656e7a80a3f663c16f4277

its still buggy and a bit slow.

Ph0rk0z on Feb 25, 2023