bitsandbytes: NaN error when using a GPU with no support for igemmlt
I get RuntimeError: probability tensor contains either inf, nan or element < 0 on most language models when trying to run them in 8bit.
I adapted a script made by lorr1 https://github.com/TimDettmers/bitsandbytes/issues/42#issue-1384920163 into a small script that first runs the model using 8bit with igemmlt and then disables the support for igemmlt and runs it again. I tested this on an RTX 3060 and the result is the RuntimeError when running without igemmlt. I think there is a bug in the code that replaces igemmlt on older GPUs.
Interestingly, it works on some models, like EleutherAI/pythia-70m-deduped, EleutherAI/gpt-neo-125M, facebook/opt-6.7b, but on most others it fails with the RuntimeError. When run with EleutherAI/pythia-410m-deduped it outputs the following:
» python 8bit_test.py
===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
8bit-reg:
Q: On average Joe throws 25 punches per minute. A fight lasts 5 rounds of 3 minutes.
How many punches did he throw?
A: Let’s think step by step.
First, Joe threw a baseball cap.
Next, he threw a bat in the air.
Joe threw a bat in the air.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Traceback (most recent call last):
File "/media/veryhighspeed/koboldai/client/8bit_test.py", line 57, in <module>
generated_ids_8bit = model_8bit.generate(input_ids, max_length=len(input_ids[0]) + MAX_NEW_TOKENS, do_sample=True)
File "/media/veryhighspeed/koboldai/client/8bit-venv/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/media/veryhighspeed/koboldai/client/8bit-venv/lib/python3.10/site-packages/transformers/generation/utils.py", line 1437, in generate
return self.sample(
File "/media/veryhighspeed/koboldai/client/8bit-venv/lib/python3.10/site-packages/transformers/generation/utils.py", line 2479, in sample
next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0
@Ph0rk0z in https://github.com/TimDettmers/bitsandbytes/issues/131#issuecomment-1418274961 also ran into this issue.
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 20
@0cc4m @Opdoop
I edit /site-packages/bitsandbytes/autograd/_functions.py
first at #406
then at 468:
and now pythia-12b in 8bits at 1.5 threshold no longer NaN on me.
I then switch to full 6.0 threshold and run inference again! It works!
@richardwth you are a hero, you fixed this bug and nobody noticed!
wahoo! https://github.com/TimDettmers/bitsandbytes/pull/335
No, I think it corresponds to the
thresholdmentioned in the bitsandbytes README, which defaults to6.0. That explains why it works on older cards with 0.8, it doesn’t convert much to 8-bit anymore.@zhaoqf123 Hi buddy, sorry for the late reply. I did not use any advanced methods as you used here. I manually inserted break points and used
torch.isnan,torch.isinfortorch.isfiniteto check which transformer layer and later on which line exactly gave the infinite results.I came across a similar problem when finetuning Llama 7B: the hidden states became inf at LlamaMLP (specifically, down_proj). I used V100 with device_capability 7.0 so igemmlt is not supported naturally. Then I found the
infhappens at this line ofautograd._functions.MatMul8bitLtThe
infhappens because output has some values larger than 65536 atF.linear.As I understand, state.CB ranges between -127 and 127 and is relatively larger than A_wo_outliers (which is confined by threshold 6.0). Wouldn’t it be safer to calculate CB first then do
F.linear? That is,Is it designed to prevent underflow? I also notice that CB is calculated first in the backward pass (line 455).
Only fix thus far is to lower the threshold for int8 like they did here: https://gist.github.com/whjms/2505ef082a656e7a80a3f663c16f4277
its still buggy and a bit slow.