metaseq: `convert_to_singleton` seems to hang for OPT-66B

What is your question?

With the directory prepared

$ ls 66b/
dict.txt         reshard-model_part-0-shard0.pt  reshard-model_part-3-shard0.pt  reshard-model_part-6-shard0.pt
gpt2-merges.txt  reshard-model_part-1-shard0.pt  reshard-model_part-4-shard0.pt  reshard-model_part-7-shard0.pt
gpt2-vocab.json  reshard-model_part-2-shard0.pt  reshard-model_part-5-shard0.pt

I had to hack checkpoint_utils.py a bit, since this assumption isn’t true for OPT-66B: https://github.com/facebookresearch/metaseq/blob/ac8659de23b680005a14490d72a874613ab59381/metaseq/checkpoint_utils.py#L390-L391

with the following instead

    # path to checkpoint...-shared.pt
    local_path = local_path.split('.')[0] + '-shard0.pt'
    paths_to_load = get_paths_to_load(local_path, suffix="shard")

Running the following

NCCL_SHM_DISABLE=1 NCCL_DEBUG=INFO python -m metaseq.scripts.convert_to_singleton 66b/

is taking a long time (22 hours and counting). Initially nvidia-smi looks like this: Screen Shot 2022-10-12 at 2 07 52 PM and then the process on GPU 5 terminated first, and it has been in the following state for hours:

$ nvidia-smi
Thu Oct 13 19:24:37 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.73.01    Driver Version: 520.61.05    CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  Off  | 00000000:00:16.0 Off |                    0 |
| N/A   54C    P0    74W / 300W |  20049MiB / 32768MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  Off  | 00000000:00:17.0 Off |                    0 |
| N/A   53C    P0    72W / 300W |  20133MiB / 32768MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  Off  | 00000000:00:18.0 Off |                    0 |
| N/A   52C    P0    73W / 300W |  19845MiB / 32768MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  Off  | 00000000:00:19.0 Off |                    0 |
| N/A   50C    P0    70W / 300W |  19857MiB / 32768MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   4  Tesla V100-SXM2...  Off  | 00000000:00:1A.0 Off |                    0 |
| N/A   54C    P0    76W / 300W |  20073MiB / 32768MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   5  Tesla V100-SXM2...  Off  | 00000000:00:1B.0 Off |                    0 |
| N/A   47C    P0    44W / 300W |   1413MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   6  Tesla V100-SXM2...  Off  | 00000000:00:1C.0 Off |                    0 |
| N/A   50C    P0    72W / 300W |  19977MiB / 32768MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   7  Tesla V100-SXM2...  Off  | 00000000:00:1D.0 Off |                    0 |
| N/A   54C    P0    69W / 300W |  19905MiB / 32768MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1335      C   python                          19788MiB |
|    1   N/A  N/A      1419      C   ...onda/envs/user/bin/python    19872MiB |
|    2   N/A  N/A      1420      C   ...onda/envs/user/bin/python    19584MiB |
|    3   N/A  N/A      1421      C   ...onda/envs/user/bin/python    19596MiB |
|    4   N/A  N/A      1422      C   ...onda/envs/user/bin/python    19812MiB |
|    6   N/A  N/A      1424      C   ...onda/envs/user/bin/python    19716MiB |
|    7   N/A  N/A      1425      C   ...onda/envs/user/bin/python    19644MiB |
+-----------------------------------------------------------------------------+

Is there something obviously wrong here, or something I should try instead? Just in case it’s really taking a long time, it’s still running. The last few logging lines at INFO level look like this:

(...)
i-0b2d24dbd20c27dd0:1422:3388 [4] NCCL INFO Channel 14 : 4[1a0] -> 2[180] via P2P/indirect/6[1c0]
i-0b2d24dbd20c27dd0:1423:3387 [5] NCCL INFO Channel 14 : 5[1b0] -> 3[190] via P2P/indirect/1[170]
i-0b2d24dbd20c27dd0:1419:3383 [1] NCCL INFO Channel 14 : 1[170] -> 7[1d0] via P2P/indirect/3[190]
i-0b2d24dbd20c27dd0:1422:3388 [4] NCCL INFO Channel 07 : 4[1a0] -> 3[190] via P2P/indirect/0[160]
i-0b2d24dbd20c27dd0:1335:3382 [0] NCCL INFO Channel 07 : 0[160] -> 7[1d0] via P2P/indirect/4[1a0]
i-0b2d24dbd20c27dd0:1422:3388 [4] NCCL INFO Channel 15 : 4[1a0] -> 3[190] via P2P/indirect/0[160]
i-0b2d24dbd20c27dd0:1335:3382 [0] NCCL INFO Channel 15 : 0[160] -> 7[1d0] via P2P/indirect/4[1a0]
i-0b2d24dbd20c27dd0:1419:3383 [1] NCCL INFO comm 0x7f5f78003090 rank 1 nranks 8 cudaDev 1 busId 170 - Init COMPLETE
i-0b2d24dbd20c27dd0:1420:3386 [2] NCCL INFO comm 0x7f7408003090 rank 2 nranks 8 cudaDev 2 busId 180 - Init COMPLETE
i-0b2d24dbd20c27dd0:1422:3388 [4] NCCL INFO comm 0x7fdfc8003090 rank 4 nranks 8 cudaDev 4 busId 1a0 - Init COMPLETE
i-0b2d24dbd20c27dd0:1335:3382 [0] NCCL INFO comm 0x7f5b60003090 rank 0 nranks 8 cudaDev 0 busId 160 - Init COMPLETE
i-0b2d24dbd20c27dd0:1424:3384 [6] NCCL INFO comm 0x7fd82c003090 rank 6 nranks 8 cudaDev 6 busId 1c0 - Init COMPLETE
i-0b2d24dbd20c27dd0:1423:3387 [5] NCCL INFO comm 0x7fd544003090 rank 5 nranks 8 cudaDev 5 busId 1b0 - Init COMPLETE
i-0b2d24dbd20c27dd0:1421:3389 [3] NCCL INFO comm 0x7f9c64003090 rank 3 nranks 8 cudaDev 3 busId 190 - Init COMPLETE
i-0b2d24dbd20c27dd0:1425:3385 [7] NCCL INFO comm 0x7f3fe0003090 rank 7 nranks 8 cudaDev 7 busId 1d0 - Init COMPLETE
i-0b2d24dbd20c27dd0:1335:1335 [0] NCCL INFO Launch mode Parallel

What’s your environment?

metaseq Version: 7828d72815a9a581ab47b95876d38cb262741883 (Oct 5 main)
PyTorch Version: 1.12.1+cu113
OS: Ubuntu 18.04.6 LTS
How you installed metaseq: pip
Build command you used (if compiling from source): N.A.
Python version: 3.10
CUDA/cuDNN version: CUDA 11.8
GPU models and configuration: 8 x V100 SXM2 32 GB

About this issue

Original URL
State: closed
Created 2 years ago
Comments: 39 (26 by maintainers)

Most upvoted comments

@punitkoura 517d7ad indeed works 🎉:

$ git checkout remotes/origin/punitkoura/debug-407
M       metaseq/service/constants.py
Previous HEAD position was 8500e88 Add logging
HEAD is now at 517d7ad Add localhost
$ 
$ metaseq-api-local
cfg.distributed_training = {'_name': None, 'distributed_world_size': 8, 'distributed_rank': 0, 'distributed_backend': 'nccl', 'distributed_init_method': 'tcp://localhost:13000', 'distributed_port': 13000, 'device_id': 0, 'distributed_no_spawn': False, 'ddp_backend': 'fully_sharded', 'bucket_cap_mb': 25, 'fix_batches_to_gpus': False, 'find_unused_parameters': False, 'fast_stat_sync': False, 'broadcast_buffers': False, 'zero_sharding': 'none', 'fp16': False, 'memory_efficient_fp16': False, 'bf16': False, 'no_reshard_after_forward': False, 'fp32_reduce_scatter': False, 'cpu_offload': False, 'use_sharded_state': True, 'gradient_predivide_factor': None, 'distributed_num_procs': 8}
2022-10-27 05:06:21 | INFO | metaseq.distributed.utils | initialized host i-050656823f6e88c4b as rank 0
2022-10-27 05:06:21 | INFO | metaseq.distributed.utils | initialized host i-050656823f6e88c4b as rank 1
2022-10-27 05:06:21 | INFO | metaseq.distributed.utils | initialized host i-050656823f6e88c4b as rank 6
2022-10-27 05:06:21 | INFO | metaseq.distributed.utils | initialized host i-050656823f6e88c4b as rank 5
2022-10-27 05:06:21 | INFO | metaseq.distributed.utils | initialized host i-050656823f6e88c4b as rank 3
2022-10-27 05:06:21 | INFO | metaseq.distributed.utils | initialized host i-050656823f6e88c4b as rank 2
2022-10-27 05:06:21 | INFO | metaseq.distributed.utils | initialized host i-050656823f6e88c4b as rank 4
2022-10-27 05:06:21 | INFO | metaseq.distributed.utils | initialized host i-050656823f6e88c4b as rank 7
In distributed utils - cfg.common.model_parallel_size = 8
> initializing tensor model parallel with size 8
> initializing pipeline model parallel with size 1
> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 2719 and data parallel seed: 1
2022-10-27 05:06:25 | INFO | metaseq.hub_utils | loading model(s) from /home/jason_chou/redspot_home/66b/reshard.pt
In load_model_ensemble_and_task filenames = ['/home/jason_chou/redspot_home/66b/reshard.pt'] arg_overrides = {} suffix = -model_part-0
Inside load_checkpoint_to_cpu path = /home/jason_chou/redspot_home/66b/reshard-model_part-0.pt arg_overrides = {}
Inside get_paths_to_load local_path = /home/jason_chou/redspot_home/66b/reshard-model_part-0.pt suffix = shard checkpoint_files = ['/home/jason_chou/redspot_home/66b/reshard-model_part-0.pt']
2022-10-27 05:10:48 | INFO | metaseq.checkpoint_utils | Done reading from disk
2022-10-27 05:10:52 | INFO | metaseq.checkpoint_utils | Done loading state dict
2022-10-27 05:10:52 | INFO | metaseq.cli.interactive | loaded model 0
2022-10-27 05:10:55 | INFO | metaseq.cli.interactive | Worker engaged! 172.21.41.241:6010
 * Serving Flask app 'metaseq.cli.interactive_hosted' (lazy loading)
 * Environment: production
   WARNING: This is a development server. Do not use it in a production deployment.
   Use a production WSGI server instead.
 * Debug mode: off
2022-10-27 05:10:55 | INFO | werkzeug | WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.
 * Running on all addresses (0.0.0.0)
 * Running on http://127.0.0.1:6010
 * Running on http://172.21.41.241:6010
2022-10-27 05:10:55 | INFO | werkzeug | Press CTRL+C to quit
2022-10-27 05:11:16 | INFO | metaseq.hub_utils | Preparing generator with settings {'_name': None, 'beam': 1, 'nbest': 1, 'max_len_a': 0, 'max_len_b': 70, 'min_len': 42, 'sampling': True, 'sampling_topp': 0.9, 'temperature': 1.0, 'no_seed_provided': False, 'buffer_size': 4194304, 'input': '-'}
2022-10-27 05:11:16 | INFO | metaseq.hub_utils | Executing generation on input tensor size torch.Size([1, 38])
2022-10-27 05:11:18 | INFO | metaseq.hub_utils | Total time: 1.235 seconds; generation time: 1.228
2022-10-27 05:11:18 | INFO | werkzeug | 127.0.0.1 - - [27/Oct/2022 05:11:18] "POST /completions HTTP/1.1" 200 -

I have checked the generated tokens and they look reasonable.

EIFY on Oct 27, 2022

To convert OPT 66B into Huggingface format, you can do the following:

Download the updated OPT-66B checkpoint from the OPT page. We have made it easier to use the checkpoint by unflattening the weights, so no FSDP are needed now.
Run the reshard_mp script to merge all model parallel parts into one singleton, using --num-output-parts 1.
Run the script transformers.models.opt.convert_opt_original_pytorch_checkpoint_to_pytorch provided by Huggingface to generate a config and convert the singleton into HF format.

You can also download the model weights for OPT-66B directly from HF (see link).

It’s been a while without any update, so I’m closing the issue now. Please let us know if you need further help.

tangbinh on Apr 5, 2023

@tangbinh The branching happens here https://github.com/facebookresearch/metaseq/blob/main/metaseq/distributed/utils.py#L42 … If a distributed port is specified we assume Slurm configuration. If it is unspecified, we go down in the if else tree to correctly infer single node init https://github.com/facebookresearch/metaseq/blob/main/metaseq/distributed/utils.py#L53

punitkoura on Oct 27, 2022

Awesome, sorry about the back and forth. I’ll add these instructions somewhere in the README here https://github.com/facebookresearch/metaseq/tree/main/metaseq/cli

punitkoura on Oct 27, 2022

Rolling back to 8500e88 and removing “–distributed-port 13000” manually indeed also works 🎉:

$ git checkout 8500e88
M       metaseq/service/constants.py
Previous HEAD position was 517d7ad Add localhost
HEAD is now at 8500e88 Add logging
$ vi metaseq/service/constants.py
$ metaseq-api-local
cfg.distributed_training = {'_name': None, 'distributed_world_size': 8, 'distributed_rank': 0, 'distributed_backend': 'nccl', 'distributed_init_method': 'tcp://localhost:12712', 'distributed_port': -1, 'device_id': 0, 'distributed_no_spawn': False, 'ddp_backend': 'fully_sharded', 'bucket_cap_mb': 25, 'fix_batches_to_gpus': False, 'find_unused_parameters': False, 'fast_stat_sync': False, 'broadcast_buffers': False, 'zero_sharding': 'none', 'fp16': False, 'memory_efficient_fp16': False, 'bf16': False, 'no_reshard_after_forward': False, 'fp32_reduce_scatter': False, 'cpu_offload': False, 'use_sharded_state': True, 'gradient_predivide_factor': None, 'distributed_num_procs': 8}
2022-10-27 05:25:08 | INFO | metaseq.distributed.utils | initialized host i-050656823f6e88c4b as rank 0
2022-10-27 05:25:08 | INFO | metaseq.distributed.utils | initialized host i-050656823f6e88c4b as rank 5
2022-10-27 05:25:08 | INFO | metaseq.distributed.utils | initialized host i-050656823f6e88c4b as rank 2
2022-10-27 05:25:08 | INFO | metaseq.distributed.utils | initialized host i-050656823f6e88c4b as rank 4
2022-10-27 05:25:08 | INFO | metaseq.distributed.utils | initialized host i-050656823f6e88c4b as rank 3
2022-10-27 05:25:08 | INFO | metaseq.distributed.utils | initialized host i-050656823f6e88c4b as rank 1
2022-10-27 05:25:08 | INFO | metaseq.distributed.utils | initialized host i-050656823f6e88c4b as rank 6
2022-10-27 05:25:08 | INFO | metaseq.distributed.utils | initialized host i-050656823f6e88c4b as rank 7
In distributed utils - cfg.common.model_parallel_size = 8
> initializing tensor model parallel with size 8
> initializing pipeline model parallel with size 1
> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 2719 and data parallel seed: 1
2022-10-27 05:25:12 | INFO | metaseq.hub_utils | loading model(s) from /home/jason_chou/redspot_home/66b/reshard.pt
In load_model_ensemble_and_task filenames = ['/home/jason_chou/redspot_home/66b/reshard.pt'] arg_overrides = {} suffix = -model_part-0
Inside load_checkpoint_to_cpu path = /home/jason_chou/redspot_home/66b/reshard-model_part-0.pt arg_overrides = {}
Inside get_paths_to_load local_path = /home/jason_chou/redspot_home/66b/reshard-model_part-0.pt suffix = shard checkpoint_files = ['/home/jason_chou/redspot_home/66b/reshard-model_part-0.pt']
2022-10-27 05:25:32 | INFO | metaseq.checkpoint_utils | Done reading from disk
2022-10-27 05:25:38 | INFO | metaseq.checkpoint_utils | Done loading state dict
2022-10-27 05:25:38 | INFO | metaseq.cli.interactive | loaded model 0
2022-10-27 05:25:39 | INFO | metaseq.cli.interactive | Worker engaged! 172.21.41.241:6010
 * Serving Flask app 'metaseq.cli.interactive_hosted' (lazy loading)
 * Environment: production
   WARNING: This is a development server. Do not use it in a production deployment.
   Use a production WSGI server instead.
 * Debug mode: off
2022-10-27 05:25:39 | INFO | werkzeug | WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.
 * Running on all addresses (0.0.0.0)
 * Running on http://127.0.0.1:6010
 * Running on http://172.21.41.241:6010
2022-10-27 05:25:39 | INFO | werkzeug | Press CTRL+C to quit
2022-10-27 05:25:58 | INFO | metaseq.hub_utils | Preparing generator with settings {'_name': None, 'beam': 1, 'nbest': 1, 'max_len_a': 0, 'max_len_b': 70, 'min_len': 42, 'sampling': True, 'sampling_topp': 0.9, 'temperature': 1.0, 'no_seed_provided': False, 'buffer_size': 4194304, 'input': '-'}
2022-10-27 05:25:58 | INFO | metaseq.hub_utils | Executing generation on input tensor size torch.Size([1, 38])
2022-10-27 05:25:59 | INFO | metaseq.hub_utils | Total time: 1.235 seconds; generation time: 1.230
2022-10-27 05:25:59 | INFO | werkzeug | 127.0.0.1 - - [27/Oct/2022 05:25:59] "POST /completions HTTP/1.1" 200 -

EIFY on Oct 27, 2022

@EIFY sorry about that. Let me replicate your steps and add some print statements in a separate branch to figure out the root cause. I’ll update this issue in a bit.

punitkoura on Oct 26, 2022

Using #430 convert_to_singleton completed successfully after writing restored.pt 🎉 However, metaseq-api-local failed to load from it as it tries to put the whole model on GPU 0:

$ metaseq-api-local
2022-10-26 21:15:46 | INFO | metaseq.hub_utils | loading model(s) from /home/jason_chou/redspot_home/66b/restored.pt
2022-10-26 21:30:11 | INFO | metaseq.checkpoint_utils | Done reading from disk
Traceback (most recent call last):
  File "/home/jason_chou/.conda/envs/user/bin/metaseq-api-local", line 8, in <module>
    sys.exit(cli_main())
  File "/home/default_user/metaseq/metaseq_cli/interactive_hosted.py", line 370, in cli_main
    distributed_utils.call_main(cfg, worker_main, namespace_args=args)
  File "/home/default_user/metaseq/metaseq/distributed/utils.py", line 279, in call_main
    return main(cfg, **kwargs)
  File "/home/default_user/metaseq/metaseq_cli/interactive_hosted.py", line 176, in worker_main
    models = generator.load_model()  # noqa: F841
  File "/home/default_user/metaseq/metaseq/hub_utils.py", line 579, in load_model
    models, _model_args, _task = _load_checkpoint()
  File "/home/default_user/metaseq/metaseq/hub_utils.py", line 562, in _load_checkpoint
    return checkpoint_utils.load_model_ensemble_and_task(
  File "/home/default_user/metaseq/metaseq/checkpoint_utils.py", line 488, in load_model_ensemble_and_task
    model = build_model_hook(cfg, task)
  File "/home/default_user/metaseq/metaseq/hub_utils.py", line 553, in _build_model
    model = task.build_model(cfg.model).cuda()
  File "/home/default_user/metaseq/metaseq/tasks/base_task.py", line 531, in build_model
    model = models.build_model(args, self)
  File "/home/default_user/metaseq/metaseq/models/__init__.py", line 87, in build_model
    return model.build_model(cfg, task)
  File "/home/default_user/metaseq/metaseq/models/transformer_lm.py", line 185, in build_model
    decoder = TransformerDecoder(
  File "/home/default_user/metaseq/metaseq/models/transformer_decoder.py", line 127, in __init__
    layers.append(self.build_decoder_layer(args))
  File "/home/default_user/metaseq/metaseq/models/transformer_decoder.py", line 253, in build_decoder_layer
    layer = self.build_base_decoder_layer(args)
  File "/home/default_user/metaseq/metaseq/models/transformer_decoder.py", line 250, in build_base_decoder_layer
    return TransformerDecoderLayer(args)
  File "/home/default_user/metaseq/metaseq/modules/transformer_decoder_layer.py", line 94, in __init__
    self.fc2 = self.build_fc2(
  File "/home/default_user/metaseq/metaseq/modules/transformer_decoder_layer.py", line 134, in build_fc2
    return Linear(
  File "/home/default_user/metaseq/metaseq/modules/linear.py", line 41, in __init__
    torch.empty(out_features, in_features, device=device, dtype=dtype)
RuntimeError: CUDA out of memory. Tried to allocate 648.00 MiB (GPU 0; 31.75 GiB total capacity; 30.64 GiB already allocated; 270.94 MiB free; 30.65 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I thought metaseq would shard the model automatically here. Are there configs / env variables that can make it? I have tried the latest main just in case and it didn’t help. @punitkoura

EIFY on Oct 26, 2022

Tl;Dr - The current state of convert_to_singleton seems to waste CPU memory by creating multiple copies of the model. Which we try to fix using the path in https://github.com/facebookresearch/metaseq/pull/430

punitkoura on Oct 26, 2022

@punitkoura I happen to be running PyTorch 1.12 so I will give it a try, but questions:

Was convert_to_singleton hanging because one of the processes died of OOM? Ideally we want it to terminate if it’s no longer possible to complete.

Yes, that is my hypothesis. I observed this when trying to consolidate other larger models as well. I agree, we should detect this condition and terminate instead of hanging.

With Convert to singleton.py - Gathering parameters only on rank 0 to save memory for large models #430 does it still require GPU 0 to hold a complete copy of the model? I don’t think 32 GB will be enough for OPT-66B…

We won’t be using GPU 0 to store the whole model. We stitch all parameters on CPU, so we won’t need extra GPU memory from what I’ve observed. (See the .cpu() call in convert_to_singleton). As long as you have enough CPU memory you should be fine. But let me know if you still face issues.

punitkoura on Oct 26, 2022

And sorry about the checkpoint naming confusion… The checkpoints should ideally be named similar to the other checkpoint names. i.e.

reshard-model_part-0.pt , reshard-model_part-1.pt , etc.

instead of having that shard0 suffix. Changing this would enable you to load the checkpoint without having to patch checkpoint_utils.py .

I’ll work on getting the names fixed in the meantime.

punitkoura on Oct 26, 2022

For context, the process you mentioned is probably terminating early because of running out of memory. The patch I tagged in my previous comment does a gather only on the first process instead of consolidating everything on all processes (and thereby creating 8 copies of the model).

punitkoura on Oct 26, 2022

@EIFY Ahh sorry for the delay! Could you have a look at this patch which saves memory when trying to consolidate different model parts into a single checkpoint? https://github.com/facebookresearch/metaseq/pull/430

I haven’t merged this since the patch requires PyTorch 1.12 at least.

punitkoura on Oct 26, 2022

Punit has a patch I believe.

stephenroller on Oct 15, 2022