accelerate: Single-machine 4-card 3090 machine training can run normally on a single GPU, but if the gradient is turned on at the same time, an error will be reported
System Info
- `Accelerate` version: 0.24.1
- Platform: Linux-5.15.0-87-generic-x86_64-with-glibc2.31
- Python version: 3.10.13
- Numpy version: 1.26.1
- PyTorch version (GPU?): 2.1.0+cu121 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- System RAM: 251.77 GB
- GPU type: NVIDIA GeForce RTX 3090
- `Accelerate` default config:
- compute_environment: LOCAL_MACHINE
- distributed_type: MULTI_GPU
- mixed_precision: fp16
- use_cpu: False
- debug: False
- num_processes: 4
- machine_rank: 0
- num_machines: 1
- gpu_ids: all
- rdzv_backend: static
- same_network: True
- main_training_function: main
- downcast_bf16: no
- tpu_use_cluster: False
- tpu_use_sudo: False
- tpu_env: []
Information
- The official example scripts
- My own modified scripts
Tasks
- One of the scripts in the examples/ folder of Accelerate or an officially supported
no_trainerscript in theexamplesfolder of thetransformersrepo (such asrun_no_trainer_glue.py) - My own task or dataset (give details below)
Reproduction
- use the https://github.com/hiyouga/LLaMA-Factory/ sample.
- When running a single card, the command is
CUDA_VISIBLE_DEVICES=1 python src/train_bash.py \
--stage pt \
--model_name_or_path /data/chatglm3-6b/ \
--do_train \
--do_eval \
--dataset wiki_demo \
--template chatglm3 \
--finetuning_type lora \
--lora_target query_key_value \
--output_dir output_pt_dir \
--overwrite_cache \
--overwrite_output_dir \
--per_device_train_batch_size 1 \
--gradient_accumulation_steps 1 \
--lr_scheduler_type cosine \
--logging_steps 10 \
--save_steps 1000 \
--learning_rate 5e-5 \
--num_train_epochs 1000.0 \
--plot_loss \
--report_to tensorboard \
--fp16
- then use the accelerate
accelerate launch src/train_bash.py \
--stage pt \
--model_name_or_path /data/chatglm3-6b/ \
--do_train \
--do_eval \
--dataset wiki_demo \
--template chatglm3 \
--finetuning_type lora \
--lora_target query_key_value \
--output_dir output_pt_dir \
--overwrite_cache \
--overwrite_output_dir \
--per_device_train_batch_size 1 \
--gradient_accumulation_steps 1 \
--lr_scheduler_type cosine \
--logging_steps 10 \
--save_steps 1000 \
--learning_rate 5e-5 \
--num_train_epochs 1000.0 \
--plot_loss \
--report_to tensorboard \
--fp16
it reported RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling cublasCreate(handle)``
my accelerate env is
Accelerateversion: 0.24.1- Platform: Linux-5.15.0-87-generic-x86_64-with-glibc2.31
- Python version: 3.10.13
- Numpy version: 1.26.1
- PyTorch version (GPU?): 2.1.0+cu121 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- System RAM: 251.77 GB
- GPU type: NVIDIA GeForce RTX 3090
Acceleratedefault config:- compute_environment: LOCAL_MACHINE
- distributed_type: MULTI_GPU
- mixed_precision: fp16
- use_cpu: False
- debug: False
- num_processes: 4
- machine_rank: 0
- num_machines: 1
- gpu_ids: all
- rdzv_backend: static
- same_network: True
- main_training_function: main
- downcast_bf16: no
- tpu_use_cluster: False
- tpu_use_sudo: False
- tpu_env: []
Expected behavior
how to solve it?
About this issue
- Original URL
- State: closed
- Created 8 months ago
- Comments: 16 (3 by maintainers)
Most upvoted comments
+1
XFR1998 on Nov 30, 2023