accelerate: Single-machine 4-card 3090 machine training can run normally on a single GPU, but if the gradient is turned on at the same time, an error will be reported

System Info

- `Accelerate` version: 0.24.1
- Platform: Linux-5.15.0-87-generic-x86_64-with-glibc2.31
- Python version: 3.10.13
- Numpy version: 1.26.1
- PyTorch version (GPU?): 2.1.0+cu121 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- System RAM: 251.77 GB
- GPU type: NVIDIA GeForce RTX 3090
- `Accelerate` default config:
	- compute_environment: LOCAL_MACHINE
	- distributed_type: MULTI_GPU
	- mixed_precision: fp16
	- use_cpu: False
	- debug: False
	- num_processes: 4
	- machine_rank: 0
	- num_machines: 1
	- gpu_ids: all
	- rdzv_backend: static
	- same_network: True
	- main_training_function: main
	- downcast_bf16: no
	- tpu_use_cluster: False
	- tpu_use_sudo: False
	- tpu_env: []

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
  • My own task or dataset (give details below)

Reproduction

  1. use the https://github.com/hiyouga/LLaMA-Factory/ sample.
  2. When running a single card, the command is
CUDA_VISIBLE_DEVICES=1 python src/train_bash.py \
    --stage pt \
    --model_name_or_path /data/chatglm3-6b/ \
    --do_train \
    --do_eval \
    --dataset wiki_demo \
    --template chatglm3 \
    --finetuning_type lora \
    --lora_target query_key_value \
    --output_dir output_pt_dir \
    --overwrite_cache \
    --overwrite_output_dir \
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 1 \
    --lr_scheduler_type cosine \
    --logging_steps 10 \
    --save_steps 1000 \
    --learning_rate 5e-5 \
    --num_train_epochs 1000.0 \
    --plot_loss \
    --report_to tensorboard \
    --fp16
  1. then use the accelerate
accelerate launch src/train_bash.py \
    --stage pt \
    --model_name_or_path /data/chatglm3-6b/ \
    --do_train \
    --do_eval \
    --dataset wiki_demo \
    --template chatglm3 \
    --finetuning_type lora \
    --lora_target query_key_value \
    --output_dir output_pt_dir \
    --overwrite_cache \
    --overwrite_output_dir \
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 1 \
    --lr_scheduler_type cosine \
    --logging_steps 10 \
    --save_steps 1000 \
    --learning_rate 5e-5 \
    --num_train_epochs 1000.0 \
    --plot_loss \
    --report_to tensorboard \
    --fp16

it reported RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling cublasCreate(handle)``

my accelerate env is

  • Accelerate version: 0.24.1
  • Platform: Linux-5.15.0-87-generic-x86_64-with-glibc2.31
  • Python version: 3.10.13
  • Numpy version: 1.26.1
  • PyTorch version (GPU?): 2.1.0+cu121 (True)
  • PyTorch XPU available: False
  • PyTorch NPU available: False
  • System RAM: 251.77 GB
  • GPU type: NVIDIA GeForce RTX 3090
  • Accelerate default config:
    • compute_environment: LOCAL_MACHINE
    • distributed_type: MULTI_GPU
    • mixed_precision: fp16
    • use_cpu: False
    • debug: False
    • num_processes: 4
    • machine_rank: 0
    • num_machines: 1
    • gpu_ids: all
    • rdzv_backend: static
    • same_network: True
    • main_training_function: main
    • downcast_bf16: no
    • tpu_use_cluster: False
    • tpu_use_sudo: False
    • tpu_env: []

Expected behavior

how to solve it?

About this issue

  • Original URL
  • State: closed
  • Created 8 months ago
  • Comments: 16 (3 by maintainers)

Most upvoted comments

No, I think this issue is the bug of Accelerate. May be because Accelerate only support Nvlink, but rtx3090 belongs to PIX. You can check your card with this type: nvidia-smi topo-m