DeepSpeed: [BUG] Error "exits with return code -7" when finetuning FLANT5-xxl on 8x A100

1. Bug Description

I am finetuning Flan-T5-xxl with my corpus using DeepSpeed, based on the tutorial. But when I execute ‘deepspeed --num_gpus=8 scripts/run_seq2seq_deepspeed.py’, when the GPUs have loaded all shards, the process has immediately been terminated, exits with return code -7, and without any further error traceback.

Here are the full console outputs:

$ deepspeed --num_gpus=8 run_seq2seq_deepspeed.py
[2023-02-24 18:10:18,983] [WARNING] [runner.py:186:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-02-24 18:10:19,049] [INFO] [runner.py:548:main] cmd = /home/aiops/xxxx/.miniconda3/envs/torch-py39/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgMywgNCwgNSwgNiwgN119 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None run_seq2seq_deepspeed.py
[2023-02-24 18:10:22,043] [INFO] [launch.py:142:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]}
[2023-02-24 18:10:22,043] [INFO] [launch.py:148:main] nnodes=1, num_local_procs=8, node_rank=0
[2023-02-24 18:10:22,043] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]})
[2023-02-24 18:10:22,043] [INFO] [launch.py:162:main] dist_world_size=8
[2023-02-24 18:10:22,043] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████| 5/5 [00:34<00:00,  6.91s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████| 5/5 [00:37<00:00,  7.49s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████| 5/5 [00:40<00:00,  8.11s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████| 5/5 [00:41<00:00,  8.37s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████| 5/5 [00:43<00:00,  8.60s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████| 5/5 [00:43<00:00,  8.74s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████| 5/5 [00:44<00:00,  8.95s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████| 5/5 [00:45<00:00,  9.10s/it]
[2023-02-24 18:12:41,354] [INFO] [comm.py:657:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
Using cuda_amp half precision backend
[2023-02-24 18:12:42,235] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed info: version=0.8.0, git-hash=unknown, git-branch=unknown
[2023-02-24 18:13:10,284] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 17786
[2023-02-24 18:13:10,284] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 17787
[2023-02-24 18:13:10,286] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 17788
[2023-02-24 18:13:10,287] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 17789
[2023-02-24 18:13:10,620] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 17790
[2023-02-24 18:13:10,953] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 17791
[2023-02-24 18:13:10,954] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 17792
[2023-02-24 18:13:11,287] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 17793
[2023-02-24 18:13:11,901] [ERROR] [launch.py:324:sigkill_handler] ['/home/aiops/xxxx/.miniconda3/envs/torch-py39/bin/python', '-u', 'run_seq2seq_deepspeed.py', '--local_rank=7'] exits with return code = -7

2. Screenshots

here is the console cursor before being killed:

[2023-02-24 18:12:42,235] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed info: version=0.8.0, git-hash=unknown, git-branch=unknown

here are the GPU and CPU states screenshots before being killed:

3. ds_report output

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
async_io ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
sparse_attn ............ [NO] ....... [OKAY]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/aiops/xxxx/.miniconda3/envs/trm-pt-py39/lib/python3.9/site-packages/torch']
torch version .................... 1.13.1+cu116
deepspeed install path ........... ['/home/aiops/xxxx/.miniconda3/envs/trm-pt-py39/lib/python3.9/site-packages/deepspeed']
deepspeed info ................... 0.8.0, unknown, unknown
torch cuda version ............... 11.6
torch hip version ................ None
nvcc version ..................... 11.2
deepspeed wheel compiled w. ...... torch 1.13, cuda 11.6

4. System Info

OS: Ubuntu 18.04
GPU: A100 (40G each) x8
CPU x60, Mem 500G

5. Solution? Help! I understand that error code -7 does not mean OOM (OOM is -9). I’ve searched over the whole Internet while failing to get the clue of error code -7 deepspeed. So please help, what is the meaning of error code -7? And how to solve this and make the code training smoothly?

About this issue

Original URL
State: closed
Created a year ago
Reactions: 3
Comments: 24 (6 by maintainers)

Most upvoted comments

Setting the shm-size to a large number instead of default 64MB when creating docker container solves the problem in my case. It appears that multi-gpu training relies on the shared memory.

+21

luchenyu on May 5, 2023

I ran with this aks cluster yaml

https://stackoverflow.com/questions/43373463/how-to-increase-shm-size-of-a-kubernetes-container-shm-size-equivalent-of-doc

or docker command docker run --rm --runtime=nvidia --gpus all --shm-size 3gb imagename

it worked

Shrishml on May 9, 2023

@Shrishml Try increasing the shm-size of the container as suggested in the comment above.

jomayeri on May 8, 2023

Did you solved it? I meet the same problem.

hahchenchen on Apr 8, 2023

Increasing the shared memory size of the docker container seems to resolve the issue.

jomayeri on May 9, 2023

Hi @tjruwase, I’m using 500G memory, with only 1-2 batch size. I tend not to believe this is the OOM problem. Also I tested with FLANT5-xl smaller one, and ended up with the same error.

scofield7419 on Mar 19, 2023