accelerate: Training with Accelerator Fails. RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:7! (when checking argument for argument index in method wrapper__index_select)

I am trying to train a BLOOM-3B model on a setup with 8 GPUS of 20GB each.

The training code is similar to the tutorial here: Distributed training with Accelerate. There is no “main” function used in my code.

The model is loaded with the device map “balanced_low_0”

if get_world_size() > 1:
    kwargs["device_map"] = "balanced_low_0"

model = AutoModelForCausalLM.from_pretrained(model_name, **kwargs)

Some of the layers are frozen using param.requires_grad = False

The accelerate config file I’m using is has the following parameters:

compute_environment: LOCAL_MACHINE
deepspeed_config: {}
distributed_type: MULTI_GPU
gpu_ids : 0,1,2,3,4,5,6,7
downcast_bf16: 'no'
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
mixed_precision: 'no'
num_machines: 1
num_processes: 2
use_cpu: false

On launching the code with accelerate and the above config I get the following error:

  File "/data/rg_data/pct_mai/Users/Anandamoy/anaconda3/envs/mqa_new/lib/python3.8/site-packages/torch/nn/modules/sparse.py", line 158, in forward
    output = old_forward(*args, **kwargs)
  File "/data/rg_data/pct_mai/Users/Anandamoy/anaconda3/envs/mqa_new/lib/python3.8/site-packages/torch/nn/modules/sparse.py", line 158, in forward
        return F.embedding(return F.embedding(

  File "/data/rg_data/pct_mai/Users/Anandamoy/anaconda3/envs/mqa_new/lib/python3.8/site-packages/torch/nn/functional.py", line 2199, in embedding
  File "/data/rg_data/pct_mai/Users/Anandamoy/anaconda3/envs/mqa_new/lib/python3.8/site-packages/torch/nn/functional.py", line 2199, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:7! (when checking argument for argument index in method wrapper__index_select)
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:7! (when checking argument for argument index in method wrapper__index_select)

I have tried with both accelerator version 0.15.0 and 0.16.0 and the problem persists. Please help me understand what am I missing?

About this issue

Original URL
State: closed
Created a year ago
Comments: 16

Most upvoted comments

@ananda1996ai First note that you cannot use data parallel in conjunction with model parallelism, so num_processes in your config needs to be 1. I cannot reproduce the error, could you copy and paste here the result of model._hf_device_map so we can have debug more? Note that for training device_map="balanced" is more recommended than device_map="balanced_low_0".

Could you also try the just released v0.17.0 to make sure your bug has not been already fixed?

sgugger on Mar 9, 2023