transformers: device_map='auto' gives bad results
System Info
-
transformers
version: 4.25.1 -
Platform: Linux-5.15.0-56-generic-x86_64-with-glibc2.17
-
Python version: 3.8.15
-
Huggingface_hub version: 0.11.1
-
PyTorch version (GPU?): 1.11.0 (True)
-
Tensorflow version (GPU?): not installed (NA)
-
Flax version (CPU?/GPU?/TPU?): not installed (NA)
-
Jax version: not installed
-
JaxLib version: not installed
-
Using GPU in script?: yes
-
Using distributed or parallel set-up in script?: no
-
GPUs: two A100
Who can help?
No response
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, …) - My own task or dataset (give details below)
Reproduction
Minimal test example:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = 'EleutherAI/gpt-neo-125M'
model = AutoModelForCausalLM.from_pretrained(model_name, device_map='auto')
tokenizer = AutoTokenizer.from_pretrained(model_name)
sentence = 'Hello, nice to meet you. How are'
with torch.no_grad():
tokenize_input = tokenizer.tokenize(sentence)
tensor_input = torch.tensor([tokenizer.convert_tokens_to_ids(tokenize_input)])
gen_tokens = model.generate(tensor_input, max_length=32)
generated = tokenizer.batch_decode(gen_tokens)[0]
print(generated)
Results:
Hello, nice to meet you. How are noise retaliateousy retaliateousy retaliateousy retaliateousy retaliateousy retaliateousy retaliateousy retaliateousy retaliateousy retaliateousy retaliateousy
The above result is not expected behavior.
Without device_map='auto'
at line 5, it works correctly.
Line 5 becomes model = AutoModelForCausalLM.from_pretrained(model_name)
Results:
Hello, nice to meet you. How are you?
I’m a bit of a newbie to the world of web development, but I
My machine has two A100 (80 GB) GPUs, and I confirmed that the model is loaded on two GPUs when I use device_map='auto'
.
Expected behavior
Explained above
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Reactions: 2
- Comments: 17 (3 by maintainers)
I solved this problem by disabling ACS in BIOS. This document might be helpful to some of you. https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html
Hello, @younesbelkada I’m using the same version
0.15.0
ofaccelerate
. I also got the correct result when I ran withexport CUDA_VISIBLE_DEVICES=0
Still wrong results with two GPUSexport CUDA_VISIBLE_DEVICES=0,1
Mmmm there is no reason for the script to give different results for different GPUs, especially since removing the device_map=“auto” gives the same results.
I also can’t reproduce on my side. Are you absolutely certain your script is launched in the same Python environment you are reporting? E.g. can you print the versions of Accelerate/Transformers/Pytorch in the same script?
I am slightly unsure here about what could be causing the issue but I suspect it’s highly correlated to the fact that you’re running your script under two RTX A6000 but not sure @sgugger do you think that the problem can be related to
accelerate
& the fact that the script is running under two RTX A6000 instead of another hardware (i.e. have you seen similar discrepancy errors in the past)? @youngwoo-yoon could you ultimately try the script with the latest pytorch version (1.13.1)?