DeepSpeed-MII: Return code -9 for OPT with 8x40GB A100 GPUs

Hello,

I’m running the following code snippet in opt.py.

import mii
mii_configs = {"tensor_parallel": 8, "dtype": "fp16", "load_with_sys_mem": True}
mii.deploy(task="text-generation", model="facebook/opt-66b", deployment_name="opt", mii_config=mii_configs)

However, it runs into the following issue wherein it just exits randomly:

❯ python opt.py
[2023-06-01 00:07:01,072] [INFO] [deployment.py:87:deploy] ************* MII is using DeepSpeed Optimizations to accelerate your model *************
[2023-06-01 00:07:01,147] [INFO] [server_client.py:219:_initialize_service] MII using multi-gpu deepspeed launcher:
 ------------------------------------------------------------
 task-name .................... text-generation
 model ........................ facebook/opt-66b
 model-path ................... /tmp/mii_models
 port ......................... 50050
 provider ..................... hugging-face
 ------------------------------------------------------------
[2023-06-01 00:07:02,641] [WARNING] [runner.py:191:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-06-01 00:07:02,711] [INFO] [runner.py:541:main] cmd = /home/azureuser/miniconda3/envs/gptneox/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgMywgNCwgNSwgNiwgN119 --master_addr=127.0.0.1 --master_port=29500 --no_python --no_local_rank --enable_each_rank_log=None /home/azureuser/miniconda3/envs/gptneox/bin/python -m mii.launch.multi_gpu_server --task-name text-generation --model facebook/opt-66b --model-path /tmp/mii_models --port 50050 --ds-optimize --provider hugging-face --config eyJ0ZW5zb3JfcGFyYWxsZWwiOiA4LCAicG9ydF9udW1iZXIiOiA1MDA1MCwgImR0eXBlIjogInRvcmNoLmZsb2F0MTYiLCAibG9hZF93aXRoX3N5c19tZW0iOiB0cnVlLCAiZW5hYmxlX2N1ZGFfZ3JhcGgiOiBmYWxzZSwgImNoZWNrcG9pbnRfZGljdCI6IG51bGwsICJkZXBsb3lfcmFuayI6IFswLCAxLCAyLCAzLCA0LCA1LCA2LCA3XSwgInRvcmNoX2Rpc3RfcG9ydCI6IDI5NTAwLCAiaGZfYXV0aF90b2tlbiI6IG51bGwsICJyZXBsYWNlX3dpdGhfa2VybmVsX2luamVjdCI6IHRydWUsICJwcm9maWxlX21vZGVsX3RpbWUiOiBmYWxzZSwgInNraXBfbW9kZWxfY2hlY2siOiBmYWxzZX0=
[2023-06-01 00:07:04,194] [INFO] [launch.py:229:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]}
[2023-06-01 00:07:04,194] [INFO] [launch.py:235:main] nnodes=1, num_local_procs=8, node_rank=0
[2023-06-01 00:07:04,194] [INFO] [launch.py:246:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]})
[2023-06-01 00:07:04,194] [INFO] [launch.py:247:main] dist_world_size=8
[2023-06-01 00:07:04,194] [INFO] [launch.py:249:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
[2023-06-01 00:07:06,163] [INFO] [server_client.py:117:_wait_until_server_is_live] waiting for server to start...
[edited out spam]
[2023-06-01 00:25:32,433] [INFO] [server_client.py:117:_wait_until_server_is_live] waiting for server to start...
[2023-06-01 00:28:11,415] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 63867
[2023-06-01 00:28:13,382] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 63868
[2023-06-01 00:28:14,431] [INFO] [server_client.py:117:_wait_until_server_is_live] waiting for server to start...
[2023-06-01 00:28:15,556] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 63869
[2023-06-01 00:28:17,729] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 63870
[2023-06-01 00:28:19,436] [INFO] [server_client.py:117:_wait_until_server_is_live] waiting for server to start...
[2023-06-01 00:28:19,823] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 63871
[2023-06-01 00:28:21,743] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 63872
[2023-06-01 00:28:23,145] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 63873
[2023-06-01 00:28:24,440] [INFO] [server_client.py:117:_wait_until_server_is_live] waiting for server to start...
[2023-06-01 00:28:24,467] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 63874
[2023-06-01 00:28:24,468] [ERROR] [launch.py:434:sigkill_handler] ['/home/azureuser/miniconda3/envs/gptneox/bin/python', '-m', 'mii.launch.multi_gpu_server', '--task-name', 'text-generation', '--model', 'facebook/opt-66b', '--model-path', '/tmp/mii_models', '--port', '50050', '--ds-optimize', '--provider', 'hugging-face', '--config', 'eyJ0ZW5zb3JfcGFyYWxsZWwiOiA4LCAicG9ydF9udW1iZXIiOiA1MDA1MCwgImR0eXBlIjogInRvcmNoLmZsb2F0MTYiLCAibG9hZF93aXRoX3N5c19tZW0iOiB0cnVlLCAiZW5hYmxlX2N1ZGFfZ3JhcGgiOiBmYWxzZSwgImNoZWNrcG9pbnRfZGljdCI6IG51bGwsICJkZXBsb3lfcmFuayI6IFswLCAxLCAyLCAzLCA0LCA1LCA2LCA3XSwgInRvcmNoX2Rpc3RfcG9ydCI6IDI5NTAwLCAiaGZfYXV0aF90b2tlbiI6IG51bGwsICJyZXBsYWNlX3dpdGhfa2VybmVsX2luamVjdCI6IHRydWUsICJwcm9maWxlX21vZGVsX3RpbWUiOiBmYWxzZSwgInNraXBfbW9kZWxfY2hlY2siOiBmYWxzZX0='] exits with return code = -9
Traceback (most recent call last):
  File "opt.py", line 3, in <module>
    mii.deploy(task="text-generation", model="facebook/opt-66b", deployment_name="bloom", mii_config=mii_configs)
  File "/home/azureuser/miniconda3/envs/gptneox/lib/python3.8/site-packages/mii/deployment.py", line 114, in deploy
    return _deploy_local(deployment_name, model_path=model_path)
  File "/home/azureuser/miniconda3/envs/gptneox/lib/python3.8/site-packages/mii/deployment.py", line 120, in _deploy_local
    mii.utils.import_score_file(deployment_name).init()
  File "/tmp/mii_cache/bloom/score.py", line 30, in init
    model = mii.MIIServerClient(task,
  File "/home/azureuser/miniconda3/envs/gptneox/lib/python3.8/site-packages/mii/server_client.py", line 92, in __init__
    self._wait_until_server_is_live()
  File "/home/azureuser/miniconda3/envs/gptneox/lib/python3.8/site-packages/mii/server_client.py", line 115, in _wait_until_server_is_live
    raise RuntimeError("server crashed for some reason, unable to proceed")
RuntimeError: server crashed for some reason, unable to proceed

Through monitoring, I’ve found that this likely happens because the host machine runs out of memory (though it might be a symptom rather than the cause). However, the host machine has 885GB RAM, so I’m not sure why it uses up so much memory for loading the OPT-66B model – it should consume much less. I also run into the same issue with bloom-175B int8 version.

Could someone please help me resolve this?

I am on the following library versions:

transformers: '4.30.0.dev0'
deepspeed: 0.9.2
mii: 0.0.4

Thanks!

About this issue

  • Original URL
  • State: open
  • Created a year ago
  • Comments: 15 (8 by maintainers)

Most upvoted comments

I can try this out tomorrow. Will loading a smaller model help? Since I expect that can fit on a single GPU.

I’ve been testing with smaller models since my local system doesn’t have enough memory for the 66B model. I was curious if the error you shared was only happening with the larger model. It looks like it is, so I’ll spin up a larger instance and debug the 66B issue. Thank you for verifying!