FastChat: Fine-tuning Vicuna-7B with Local GPUs: RuntimeError: Expected is_sm80 || is_sm90 to be true, but got false

RuntimeError: Expected is_sm80 || is_sm90 to be true, but got false. (Could this error message be improved? If so, please report an enhancement request to PyTorch.)

0%| | 0/3096 [00:00<?, ?it/s] use_cache=True is incompatible with gradient checkpointing. Setting use_cache=False

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 24 (3 by maintainers)

Most upvoted comments

I’m facing similar issue

Expected q_dtype == torch::kFloat16 || ((is_sm8x || is_sm90) && q_dtype == torch::kBFloat16) to be true, but got false

@zhisbug I think if the official can run with flash-attention in A100(same hardware). Maybe providing a environment about particular version can help others to solve the confusions.

I’m closing this issue because this seems to be a flash attention issue.

We’ll soon migrate to use the xformer (https://github.com/facebookresearch/xformers) in place of flashattention, as our internal tests show they have similar memory/compute performance, but xformer is much more stable, maintained by Meta, supports more types of GPUs, and is more extensible.

Are you sure? flash-attn v2 supports dim up to 256. I am able to use it on 3090

FlashAttention-2 currently supports:

  1. Ampere, Ada, or Hopper GPUs (e.g., A100, RTX 3090, RTX 4090, H100). Support for Turing GPUs (T4, RTX 2080) is coming soon, please use FlashAttention 1.x for Turing GPUs for now.
  2. Datatype fp16 and bf16 (bf16 requires Ampere, Ada, or Hopper GPUs).
  3. All head dimensions up to 256. Head dim > 192 backward requires A100/A800 or H100/H800.

Try #2126

Yes, thank you. I formerly use flash attention 1.x. Now 2.0 supports.

Are you sure? flash-attn v2 supports dim up to 256. I am able to use it on 3090

FlashAttention-2 currently supports:

  1. Ampere, Ada, or Hopper GPUs (e.g., A100, RTX 3090, RTX 4090, H100). Support for Turing GPUs (T4, RTX 2080) is coming soon, please use FlashAttention 1.x for Turing GPUs for now.
  2. Datatype fp16 and bf16 (bf16 requires Ampere, Ada, or Hopper GPUs).
  3. All head dimensions up to 256. Head dim > 192 backward requires A100/A800 or H100/H800.

Try #2126

If I replace bf16 True with fp16 True in the script args and also add "fp16": {"enabled": true} to my deepspeed config, the error changes to RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn, and the relevant part of the traceback is

─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮                                                                                                       
│ /root/FastChat/fastchat/train/train_lora_llama.py:163 in <module>                                │                                                                                                       
│                                                                                                  │                                                                                                       
│   160                                                                                            │                                                                                                       
│   161                                                                                            │                                                                                                       
│   162 if __name__ == "__main__":                                                                 │                                                                                                       
│ ❱ 163 │   train()                                                                                │                                                                                                       
│   164                                                                                            │                                                                                                       
│                                                                                                  │                                                                                                       
│ /root/FastChat/fastchat/train/train_lora_llama.py:153 in train                                   │                                                                                                       
│                                                                                                  │                                                                                                       
│   150 │   if list(pathlib.Path(training_args.output_dir).glob("checkpoint-*")):                  │                                                                                                       
│   151 │   │   trainer.train(resume_from_checkpoint=True)                                         │                                                                                                       
│   152 │   else:                                                                                  │                                                                                                       
│ ❱ 153 │   │   trainer.train()                                                                    │                                                                                                       
│   154 │   trainer.save_state()                                                                   │                                                                                                       
│   155 │                                                                                          │                                                                                                       
│   156 │   # Save states. Weights might be a placeholder in zero3 and need a gather               │                                                                                                       
│                                                                                                  │                                                                                                       
│ /root/anaconda3/envs/fastchat/lib/python3.10/site-packages/transformers/trainer.py:1662 in train │                                                                                                       
│                                                                                                  │                                                                                                       
│   1659 │   │   inner_training_loop = find_executable_batch_size(                                 │                                                                                                       
│   1660 │   │   │   self._inner_training_loop, self._train_batch_size, args.auto_find_batch_size  │                                                                                                       
│   1661 │   │   )                                                                                 │                                                                                                       
│ ❱ 1662 │   │   return inner_training_loop(                                                       │                                                                                                       
│   1663 │   │   │   args=args,                                                                    │                                                                                                       
│   1664 │   │   │   resume_from_checkpoint=resume_from_checkpoint,                                │                                                                                                       
│   1665 │   │   │   trial=trial,                                                                  │                                                                                                       
│                                                                                                  │                                                                                                       
│ /root/anaconda3/envs/fastchat/lib/python3.10/site-packages/transformers/trainer.py:1929 in       │                                                                                                       
│ _inner_training_loop                                                                             │                                                                                                       
│                                                                                                  │                                                                                                       
│   1926 │   │   │   │   │   with model.no_sync():                                                 │                                                                                                       
│   1927 │   │   │   │   │   │   tr_loss_step = self.training_step(model, inputs)                  │                                                                                                       
│   1928 │   │   │   │   else:                                                                     │                                                                                                       
│ ❱ 1929 │   │   │   │   │   tr_loss_step = self.training_step(model, inputs)                      │                                                                                                       
│   1930 │   │   │   │                                                                             │                                                                                                       
│   1931 │   │   │   │   if (                                                                      │                                                                                                       
│   1932 │   │   │   │   │   args.logging_nan_inf_filter                                           │                                                                                                       
│                                                                                                  │                                                                                                       
│ /root/anaconda3/envs/fastchat/lib/python3.10/site-packages/transformers/trainer.py:2715 in       │                                                                                                       
│ training_step                                                                                    │                                                                                                       
│                                                                                                  │                                                                                                       
│   2712 │   │   │   │   scaled_loss.backward()                                                    │                                                                                                       
│   2713 │   │   elif self.deepspeed:                                                              │                                                                                                       
│   2714 │   │   │   # loss gets scaled under gradient_accumulation_steps in deepspeed             │                                                                                                       
│ ❱ 2715 │   │   │   loss = self.deepspeed.backward(loss)                                          │                                                                                                       
│   2716 │   │   else:                                                                             │                                                                                                       
│   2717 │   │   │   loss.backward()                                                               │                                                                                                       
│   2718                                                                                           │                                                                                                             │
│ /root/anaconda3/envs/fastchat/lib/python3.10/site-packages/deepspeed/utils/nvtx.py:15 in         │
│ wrapped_fn                                                                                       │
│                                                                                                  │
│   12 │                                                                                           │
│   13 │   def wrapped_fn(*args, **kwargs):                                                        │
│   14 │   │   get_accelerator().range_push(func.__qualname__)                                     │
│ ❱ 15 │   │   ret_val = func(*args, **kwargs)                                                     │
│   16 │   │   get_accelerator().range_pop()                                                       │
│   17 │   │   return ret_val                                                                      │
│   18                                                                                             │
│                                                                                                  │
│ /root/anaconda3/envs/fastchat/lib/python3.10/site-packages/deepspeed/runtime/engine.py:1796 in   │
│ backward                                                                                         │
│                                                                                                  │
│   1793 │   │                                                                                     │
│   1794 │   │   if self.zero_optimization():                                                      │
│   1795 │   │   │   self.optimizer.is_gradient_accumulation_boundary = self.is_gradient_accumula  │
│ ❱ 1796 │   │   │   self.optimizer.backward(loss, retain_graph=retain_graph)                      │
│   1797 │   │   elif self.amp_enabled():                                                          │
│   1798 │   │   │   # AMP requires delaying unscale when inside gradient accumulation boundaries  │
│   1799 │   │   │   # https://nvidia.github.io/apex/advanced.html#gradient-accumulation-across-i  │
│                                                                                                  │
│ /root/anaconda3/envs/fastchat/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2. │
│ py:1890 in backward                                                                              │
│                                                                                                  │
│   1887 │   │   │   scaled_loss = self.external_loss_scale * loss                                 │
│   1888 │   │   │   scaled_loss.backward()                                                        │
│   1889 │   │   else:                                                                             │
│ ❱ 1890 │   │   │   self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)            │
│   1891 │                                                                                         │
│   1892 │   def check_overflow(self, partition_gradients=True):                                   │
│   1893 │   │   self._check_overflow(partition_gradients)                                         │
│                                                                                                  │
│ /root/anaconda3/envs/fastchat/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py │
│ :62 in backward                                                                                  │
│                                                                                                  │
│    59 │                                                                                          │
│    60 │   def backward(self, loss, retain_graph=False):                                          │
│    61 │   │   scaled_loss = loss * self.loss_scale                                               │
│ ❱  62 │   │   scaled_loss.backward(retain_graph=retain_graph)                                    │
│    63 │   │   # print(f'LossScalerBackward: {scaled_loss=}')                                     │
│    64                                                                                            │
│    65                                                                                            │
│                                                                                                  │
                                                                                                  │
│ /root/anaconda3/envs/fastchat/lib/python3.10/site-packages/torch/_tensor.py:487 in backward      │
│                                                                                                  │
│    484 │   │   │   │   create_graph=create_graph,                                                │
│    485 │   │   │   │   inputs=inputs,                                                            │
│    486 │   │   │   )                                                                             │
│ ❱  487 │   │   torch.autograd.backward(                                                          │
│    488 │   │   │   self, gradient, retain_graph, create_graph, inputs=inputs                     │
│    489 │   │   )                                                                                 │
│    490                                                                                           │
│                                                                                                  │
│ /root/anaconda3/envs/fastchat/lib/python3.10/site-packages/torch/autograd/__init__.py:200 in     │
│ backward                                                                                         │
│                                                                                                  │
│   197 │   # The reason we repeat same the comment below is that                                  │
│   198 │   # some Python versions print out the first line of a multi-line function               │
│   199 │   # calls in the traceback and some print out the last line                              │
│ ❱ 200 │   Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the bac   │
│   201 │   │   tensors, grad_tensors_, retain_graph, create_graph, inputs,                        │
│   202 │   │   allow_unreachable=True, accumulate_grad=True)  # Calls into the C++ engine to ru   │
│   203                                                                                            │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯