accelerate: Single-machine 4-card 3090 machine training can run normally on a single GPU, but if the gradient is turned on at the same time, an error will be reported

System Info

- `Accelerate` version: 0.24.1
- Platform: Linux-5.15.0-87-generic-x86_64-with-glibc2.31
- Python version: 3.10.13
- Numpy version: 1.26.1
- PyTorch version (GPU?): 2.1.0+cu121 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- System RAM: 251.77 GB
- GPU type: NVIDIA GeForce RTX 3090
- `Accelerate` default config:
	- compute_environment: LOCAL_MACHINE
	- distributed_type: MULTI_GPU
	- mixed_precision: fp16
	- use_cpu: False
	- debug: False
	- num_processes: 4
	- machine_rank: 0
	- num_machines: 1
	- gpu_ids: all
	- rdzv_backend: static
	- same_network: True
	- main_training_function: main
	- downcast_bf16: no
	- tpu_use_cluster: False
	- tpu_use_sudo: False
	- tpu_env: []

Information

The official example scripts
My own modified scripts

Tasks

One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
My own task or dataset (give details below)

Reproduction

use the https://github.com/hiyouga/LLaMA-Factory/ sample.
When running a single card, the command is

CUDA_VISIBLE_DEVICES=1 python src/train_bash.py \
    --stage pt \
    --model_name_or_path /data/chatglm3-6b/ \
    --do_train \
    --do_eval \
    --dataset wiki_demo \
    --template chatglm3 \
    --finetuning_type lora \
    --lora_target query_key_value \
    --output_dir output_pt_dir \
    --overwrite_cache \
    --overwrite_output_dir \
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 1 \
    --lr_scheduler_type cosine \
    --logging_steps 10 \
    --save_steps 1000 \
    --learning_rate 5e-5 \
    --num_train_epochs 1000.0 \
    --plot_loss \
    --report_to tensorboard \
    --fp16

then use the accelerate

accelerate launch src/train_bash.py \
    --stage pt \
    --model_name_or_path /data/chatglm3-6b/ \
    --do_train \
    --do_eval \
    --dataset wiki_demo \
    --template chatglm3 \
    --finetuning_type lora \
    --lora_target query_key_value \
    --output_dir output_pt_dir \
    --overwrite_cache \
    --overwrite_output_dir \
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 1 \
    --lr_scheduler_type cosine \
    --logging_steps 10 \
    --save_steps 1000 \
    --learning_rate 5e-5 \
    --num_train_epochs 1000.0 \
    --plot_loss \
    --report_to tensorboard \
    --fp16

it reported RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling cublasCreate(handle)``

my accelerate env is

Accelerate version: 0.24.1
Platform: Linux-5.15.0-87-generic-x86_64-with-glibc2.31
Python version: 3.10.13
Numpy version: 1.26.1
PyTorch version (GPU?): 2.1.0+cu121 (True)
PyTorch XPU available: False
PyTorch NPU available: False
System RAM: 251.77 GB
GPU type: NVIDIA GeForce RTX 3090
Accelerate default config:
- compute_environment: LOCAL_MACHINE
- distributed_type: MULTI_GPU
- mixed_precision: fp16
- use_cpu: False
- debug: False
- num_processes: 4
- machine_rank: 0
- num_machines: 1
- gpu_ids: all
- rdzv_backend: static
- same_network: True
- main_training_function: main
- downcast_bf16: no
- tpu_use_cluster: False
- tpu_use_sudo: False
- tpu_env: []

Expected behavior

how to solve it?

About this issue

Original URL
State: closed
Created 8 months ago
Comments: 16 (3 by maintainers)

Most upvoted comments

No, I think this issue is the bug of Accelerate. May be because Accelerate only support Nvlink, but rtx3090 belongs to PIX. You can check your card with this type: nvidia-smi topo-m

XFR1998 on Nov 30, 2023