transformers: device_map='auto' gives bad results

System Info

transformers version: 4.25.1
Platform: Linux-5.15.0-56-generic-x86_64-with-glibc2.17
Python version: 3.8.15
Huggingface_hub version: 0.11.1
PyTorch version (GPU?): 1.11.0 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: yes
Using distributed or parallel set-up in script?: no
GPUs: two A100

Who can help?

No response

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, …)
My own task or dataset (give details below)

Reproduction

Minimal test example:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = 'EleutherAI/gpt-neo-125M'
model = AutoModelForCausalLM.from_pretrained(model_name, device_map='auto')
tokenizer = AutoTokenizer.from_pretrained(model_name)

sentence = 'Hello, nice to meet you. How are'
with torch.no_grad():
    tokenize_input = tokenizer.tokenize(sentence)
    tensor_input = torch.tensor([tokenizer.convert_tokens_to_ids(tokenize_input)])
    gen_tokens = model.generate(tensor_input, max_length=32)
    generated = tokenizer.batch_decode(gen_tokens)[0]

print(generated)

Results:

Hello, nice to meet you. How are noise retaliateousy retaliateousy retaliateousy retaliateousy retaliateousy retaliateousy retaliateousy retaliateousy retaliateousy retaliateousy retaliateousy

The above result is not expected behavior. Without device_map='auto' at line 5, it works correctly. Line 5 becomes model = AutoModelForCausalLM.from_pretrained(model_name)

Results:

Hello, nice to meet you. How are you?

I’m a bit of a newbie to the world of web development, but I

My machine has two A100 (80 GB) GPUs, and I confirmed that the model is loaded on two GPUs when I use device_map='auto'.

Expected behavior

Explained above

About this issue

Original URL
State: closed
Created 2 years ago
Reactions: 2
Comments: 17 (3 by maintainers)

Most upvoted comments

I solved this problem by disabling ACS in BIOS. This document might be helpful to some of you. https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html

youngwoo-yoon on Sep 2, 2023

Hello, @younesbelkada I’m using the same version 0.15.0 of accelerate. I also got the correct result when I ran with export CUDA_VISIBLE_DEVICES=0 Still wrong results with two GPUS export CUDA_VISIBLE_DEVICES=0,1

youngwoo-yoon on Dec 26, 2022

Mmmm there is no reason for the script to give different results for different GPUs, especially since removing the device_map=“auto” gives the same results.

I also can’t reproduce on my side. Are you absolutely certain your script is launched in the same Python environment you are reporting? E.g. can you print the versions of Accelerate/Transformers/Pytorch in the same script?

sgugger on Dec 27, 2022

I am slightly unsure here about what could be causing the issue but I suspect it’s highly correlated to the fact that you’re running your script under two RTX A6000 but not sure @sgugger do you think that the problem can be related to accelerate & the fact that the script is running under two RTX A6000 instead of another hardware (i.e. have you seen similar discrepancy errors in the past)? @youngwoo-yoon could you ultimately try the script with the latest pytorch version (1.13.1)?

younesbelkada on Dec 26, 2022