metaseq: `convert_to_singleton` seems to hang for OPT-66B
What is your question?
With the directory prepared
$ ls 66b/
dict.txt reshard-model_part-0-shard0.pt reshard-model_part-3-shard0.pt reshard-model_part-6-shard0.pt
gpt2-merges.txt reshard-model_part-1-shard0.pt reshard-model_part-4-shard0.pt reshard-model_part-7-shard0.pt
gpt2-vocab.json reshard-model_part-2-shard0.pt reshard-model_part-5-shard0.pt
I had to hack checkpoint_utils.py
a bit, since this assumption isn’t true for OPT-66B:
https://github.com/facebookresearch/metaseq/blob/ac8659de23b680005a14490d72a874613ab59381/metaseq/checkpoint_utils.py#L390-L391
with the following instead
# path to checkpoint...-shared.pt
local_path = local_path.split('.')[0] + '-shard0.pt'
paths_to_load = get_paths_to_load(local_path, suffix="shard")
Running the following
NCCL_SHM_DISABLE=1 NCCL_DEBUG=INFO python -m metaseq.scripts.convert_to_singleton 66b/
is taking a long time (22 hours and counting). Initially nvidia-smi
looks like this:
and then the process on
GPU 5
terminated first, and it has been in the following state for hours:
$ nvidia-smi
Thu Oct 13 19:24:37 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.73.01 Driver Version: 520.61.05 CUDA Version: 11.8 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... Off | 00000000:00:16.0 Off | 0 |
| N/A 54C P0 74W / 300W | 20049MiB / 32768MiB | 100% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... Off | 00000000:00:17.0 Off | 0 |
| N/A 53C P0 72W / 300W | 20133MiB / 32768MiB | 100% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-SXM2... Off | 00000000:00:18.0 Off | 0 |
| N/A 52C P0 73W / 300W | 19845MiB / 32768MiB | 100% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-SXM2... Off | 00000000:00:19.0 Off | 0 |
| N/A 50C P0 70W / 300W | 19857MiB / 32768MiB | 100% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 4 Tesla V100-SXM2... Off | 00000000:00:1A.0 Off | 0 |
| N/A 54C P0 76W / 300W | 20073MiB / 32768MiB | 100% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 5 Tesla V100-SXM2... Off | 00000000:00:1B.0 Off | 0 |
| N/A 47C P0 44W / 300W | 1413MiB / 32768MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 6 Tesla V100-SXM2... Off | 00000000:00:1C.0 Off | 0 |
| N/A 50C P0 72W / 300W | 19977MiB / 32768MiB | 100% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 7 Tesla V100-SXM2... Off | 00000000:00:1D.0 Off | 0 |
| N/A 54C P0 69W / 300W | 19905MiB / 32768MiB | 100% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1335 C python 19788MiB |
| 1 N/A N/A 1419 C ...onda/envs/user/bin/python 19872MiB |
| 2 N/A N/A 1420 C ...onda/envs/user/bin/python 19584MiB |
| 3 N/A N/A 1421 C ...onda/envs/user/bin/python 19596MiB |
| 4 N/A N/A 1422 C ...onda/envs/user/bin/python 19812MiB |
| 6 N/A N/A 1424 C ...onda/envs/user/bin/python 19716MiB |
| 7 N/A N/A 1425 C ...onda/envs/user/bin/python 19644MiB |
+-----------------------------------------------------------------------------+
Is there something obviously wrong here, or something I should try instead? Just in case it’s really taking a long time, it’s still running. The last few logging lines at INFO
level look like this:
(...)
i-0b2d24dbd20c27dd0:1422:3388 [4] NCCL INFO Channel 14 : 4[1a0] -> 2[180] via P2P/indirect/6[1c0]
i-0b2d24dbd20c27dd0:1423:3387 [5] NCCL INFO Channel 14 : 5[1b0] -> 3[190] via P2P/indirect/1[170]
i-0b2d24dbd20c27dd0:1419:3383 [1] NCCL INFO Channel 14 : 1[170] -> 7[1d0] via P2P/indirect/3[190]
i-0b2d24dbd20c27dd0:1422:3388 [4] NCCL INFO Channel 07 : 4[1a0] -> 3[190] via P2P/indirect/0[160]
i-0b2d24dbd20c27dd0:1335:3382 [0] NCCL INFO Channel 07 : 0[160] -> 7[1d0] via P2P/indirect/4[1a0]
i-0b2d24dbd20c27dd0:1422:3388 [4] NCCL INFO Channel 15 : 4[1a0] -> 3[190] via P2P/indirect/0[160]
i-0b2d24dbd20c27dd0:1335:3382 [0] NCCL INFO Channel 15 : 0[160] -> 7[1d0] via P2P/indirect/4[1a0]
i-0b2d24dbd20c27dd0:1419:3383 [1] NCCL INFO comm 0x7f5f78003090 rank 1 nranks 8 cudaDev 1 busId 170 - Init COMPLETE
i-0b2d24dbd20c27dd0:1420:3386 [2] NCCL INFO comm 0x7f7408003090 rank 2 nranks 8 cudaDev 2 busId 180 - Init COMPLETE
i-0b2d24dbd20c27dd0:1422:3388 [4] NCCL INFO comm 0x7fdfc8003090 rank 4 nranks 8 cudaDev 4 busId 1a0 - Init COMPLETE
i-0b2d24dbd20c27dd0:1335:3382 [0] NCCL INFO comm 0x7f5b60003090 rank 0 nranks 8 cudaDev 0 busId 160 - Init COMPLETE
i-0b2d24dbd20c27dd0:1424:3384 [6] NCCL INFO comm 0x7fd82c003090 rank 6 nranks 8 cudaDev 6 busId 1c0 - Init COMPLETE
i-0b2d24dbd20c27dd0:1423:3387 [5] NCCL INFO comm 0x7fd544003090 rank 5 nranks 8 cudaDev 5 busId 1b0 - Init COMPLETE
i-0b2d24dbd20c27dd0:1421:3389 [3] NCCL INFO comm 0x7f9c64003090 rank 3 nranks 8 cudaDev 3 busId 190 - Init COMPLETE
i-0b2d24dbd20c27dd0:1425:3385 [7] NCCL INFO comm 0x7f3fe0003090 rank 7 nranks 8 cudaDev 7 busId 1d0 - Init COMPLETE
i-0b2d24dbd20c27dd0:1335:1335 [0] NCCL INFO Launch mode Parallel
What’s your environment?
- metaseq Version: 7828d72815a9a581ab47b95876d38cb262741883 (Oct 5 main)
- PyTorch Version: 1.12.1+cu113
- OS: Ubuntu 18.04.6 LTS
- How you installed metaseq:
pip
- Build command you used (if compiling from source): N.A.
- Python version: 3.10
- CUDA/cuDNN version: CUDA 11.8
- GPU models and configuration: 8 x V100 SXM2 32 GB
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 39 (26 by maintainers)
@punitkoura 517d7ad indeed works 🎉:
I have checked the generated tokens and they look reasonable.
To convert OPT 66B into Huggingface format, you can do the following:
reshard_mp
script to merge all model parallel parts into one singleton, using--num-output-parts 1
.transformers.models.opt.convert_opt_original_pytorch_checkpoint_to_pytorch
provided by Huggingface to generate a config and convert the singleton into HF format.You can also download the model weights for OPT-66B directly from HF (see link).
It’s been a while without any update, so I’m closing the issue now. Please let us know if you need further help.
@tangbinh The branching happens here https://github.com/facebookresearch/metaseq/blob/main/metaseq/distributed/utils.py#L42 … If a distributed port is specified we assume Slurm configuration. If it is unspecified, we go down in the if else tree to correctly infer single node init https://github.com/facebookresearch/metaseq/blob/main/metaseq/distributed/utils.py#L53
Awesome, sorry about the back and forth. I’ll add these instructions somewhere in the README here https://github.com/facebookresearch/metaseq/tree/main/metaseq/cli
Rolling back to 8500e88 and removing “–distributed-port 13000” manually indeed also works 🎉:
@EIFY sorry about that. Let me replicate your steps and add some print statements in a separate branch to figure out the root cause. I’ll update this issue in a bit.
Using #430
convert_to_singleton
completed successfully after writingrestored.pt
🎉 However,metaseq-api-local
failed to load from it as it tries to put the whole model onGPU 0
:I thought metaseq would shard the model automatically here. Are there configs / env variables that can make it? I have tried the latest
main
just in case and it didn’t help. @punitkouraTl;Dr - The current state of convert_to_singleton seems to waste CPU memory by creating multiple copies of the model. Which we try to fix using the path in https://github.com/facebookresearch/metaseq/pull/430
Yes, that is my hypothesis. I observed this when trying to consolidate other larger models as well. I agree, we should detect this condition and terminate instead of hanging.
We won’t be using GPU 0 to store the whole model. We stitch all parameters on CPU, so we won’t need extra GPU memory from what I’ve observed. (See the .cpu() call in convert_to_singleton). As long as you have enough CPU memory you should be fine. But let me know if you still face issues.
And sorry about the checkpoint naming confusion… The checkpoints should ideally be named similar to the other checkpoint names. i.e.
instead of having that shard0 suffix. Changing this would enable you to load the checkpoint without having to patch checkpoint_utils.py .
I’ll work on getting the names fixed in the meantime.
For context, the process you mentioned is probably terminating early because of running out of memory. The patch I tagged in my previous comment does a gather only on the first process instead of consolidating everything on all processes (and thereby creating 8 copies of the model).
@EIFY Ahh sorry for the delay! Could you have a look at this patch which saves memory when trying to consolidate different model parts into a single checkpoint? https://github.com/facebookresearch/metaseq/pull/430
I haven’t merged this since the patch requires PyTorch 1.12 at least.
Punit has a patch I believe.