DeepSpeed: [BUG] I have been trying to run deepspeed on 32 GB Tesla V 100 GPU
Describe the bug I have been trying to run deepspeed on 32 GB Tesla V 100 GPU but it still does not work. I tried parellelizing it over 4 GPUs as well and it shows me a SIGKILL
To Reproduce Here is the code i ran
`import os import deepspeed import torch from transformers import pipeline
local_rank = int(os.getenv(‘LOCAL_RANK’, ‘0’)) world_size = int(os.getenv(‘WORLD_SIZE’, ‘1’)) generator = pipeline(‘text-generation’, model=‘EleutherAI/gpt-neo-2.7B’)
generator.model = deepspeed.init_inference(generator.model, mp_size=world_size, dtype=torch.float, replace_with_kernel_inject=True)
string = generator(“DeepSpeed is”, do_sample=True, min_length=50) if not torch.distributed.is_initialized() or torch.distributed.get_rank() == 0: print(string) `
Docker context Are you using a specific docker image that you can share?
Additional context Add any other context about the problem here.
About this issue
- Original URL
- State: open
- Created a year ago
- Comments: 15 (5 by maintainers)
@AbhayGoyal I was facing the same issue on V100. In my case my process crashed with SIGKILL when I ran out of System RAM. The reason is that the model is first loaded on the CPU, and then moved to GPU by Deepspeed. So if you run the script with more than one GPUs, DS loads multiple instances of the model and may cause system memory to be exceeded. Can you check the amount of RAM (System RAM not GPU RAM) available? You should run the Inference script and then monitor the RAM using “free -s2 -g”.