llm-foundry: Error:"Watchdog caught collective operation timeout" when finetuning MPT-7B on a local dataset using 2 A100 GPUs

Hi, I am trying to finetune the MPT-7B model using a local dataset on 2 A100 - 80GB GPUs. Below is the complete log. Torch Version: 1.13.1+cu117 Appreciate any help to resolve the issue.

/mpt-7b/llm-foundry/scripts/train# composer train.py yamls/finetune/mpt-7b_jokes.yaml Initializing model… Explicitly passing a revision is encouraged when loading a configuration with custom code to ensure no malicious code has been contributed in a newer revision. Explicitly passing a revision is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision. /root/.cache/huggingface/modules/transformers_modules/mosaicml/mpt-7b/14958374ab073ba1030c0caef4ae8380045bae45/attention.py:153: UserWarning: While attn_impl: triton can be faster than attn_impl: flash it uses more memory. When training larger models this can trigger alloc retries which hurts performance. If encountered, we recommend using attn_impl: flash if your model does not use alibi or prefix_lm. warnings.warn('While attn_impl: triton can be faster than attn_impl: flash ’ + 'it uses more memory. When training larger models this can trigger ’ + 'alloc retries which hurts performance. If encountered, we recommend ’ + ‘using attn_impl: flash if your model does not use alibi or prefix_lm.’) Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:08<00:00, 4.42s/it] cfg.n_params=6.65e+09 Building train loader… Using pad_token, but it is not set yet. No preprocessor was supplied and no preprocessing function is registered for dataset name “local_dataset”. No additional preprocessing will be applied. If the dataset is already formatted correctly, you can ignore this message. Found cached dataset json (/root/.cache/huggingface/datasets/json/local_dataset-f2d32eae91a10861/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51) Loading cached processed dataset at /root/.cache/huggingface/datasets/json/local_dataset-f2d32eae91a10861/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51/cache-e96cabec8ddb1637.arrow Building eval loader… No preprocessor was supplied and no preprocessing function is registered for dataset name “local_dataset”. No additional preprocessing will be applied. If the dataset is already formatted correctly, you can ignore this message. Found cached dataset json (/root/.cache/huggingface/datasets/json/local_dataset-f2d32eae91a10861/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51) Loading cached processed dataset at /root/.cache/huggingface/datasets/json/local_dataset-f2d32eae91a10861/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51/cache-76316eeb9f4e44d9.arrow Building trainer… Logging config… max_seq_len: 2048 global_seed: 17 run_name: mpt-7b-finetune model: name: hf_causal_lm pretrained: true pretrained_model_name_or_path: mosaicml/mpt-7b config_overrides: attn_config: attn_impl: triton attn_uses_sequence_id: false tokenizer: name: mosaicml/mpt-7b kwargs: model_max_length: ${max_seq_len} train_loader: name: finetuning dataset: hf_name: local_dataset split: train tokenizer_name: ${tokenizer_name} max_seq_len: ${max_seq_len} allow_pad_trimming: false decoder_only_format: true shuffle: true drop_last: true num_workers: 8 pin_memory: false prefetch_factor: 2 persistent_workers: true timeout: 0 eval_loader: name: finetuning dataset: hf_name: local_dataset split: test tokenizer_name: ${tokenizer_name} max_seq_len: ${max_seq_len} allow_pad_trimming: false decoder_only_format: true shuffle: true drop_last: true num_workers: 8 pin_memory: false prefetch_factor: 2 persistent_workers: true timeout: 0 scheduler: name: linear_decay_with_warmup t_warmup: 50ba alpha_f: 0 optimizer: name: decoupled_adamw lr: 5.0e-06 betas:

  • 0.9
  • 0.999 eps: 1.0e-08 weight_decay: 0 algorithms: gradient_clipping: clipping_type: norm clipping_threshold: 1.0 max_duration: 2ep eval_interval: 1ep eval_first: true global_train_batch_size: 8 seed: ${global_seed} device_eval_batch_size: 4 device_train_microbatch_size: 4 precision: amp_bf16 fsdp_config: sharding_strategy: FULL_SHARD mixed_precision: PURE activation_checkpointing: true activation_checkpointing_reentrant: false activation_cpu_offload: false limit_all_gathers: true verbose: false progress_bar: false log_to_console: true console_log_interval: 1ba callbacks: speed_monitor: window_size: 10 lr_monitor: {} memory_monitor: {} runtime_estimator: {} save_folder: ./{run_name}/checkpoints dist_timeout: 600.0 n_gpus: 2 device_train_batch_size: 4 device_train_grad_accum: 1 n_params: 6649286656

Config: node_name: unknown because NODENAME environment variable not set num_gpus_per_node: 2 num_nodes: 1 rank_zero_seed: 17


[E ProcessGroupNCCL.cpp:821] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2, OpType=_ALLGATHER_BASE, Timeout(ms)=600000) ran for 600963 milliseconds before timing out. ERROR:composer.cli.launcher:Rank 1 crashed with exit code -6. Waiting up to 30 seconds for all training processes to terminate. Press Ctrl-C to exit immediately. [E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down. terminate called after throwing an instance of ‘std::runtime_error’ what(): [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2, OpType=_ALLGATHER_BASE, Timeout(ms)=600000) ran for 600963 milliseconds before timing out. Global rank 0 (PID 786) exited with code -6 Global rank 1 (PID 787) exited with code -6 ----------Begin global rank 1 STDOUT---------- Initializing model… cfg.n_params=6.65e+09 Building train loader… No preprocessor was supplied and no preprocessing function is registered for dataset name “local_dataset”. No additional preprocessing will be applied. If the dataset is already formatted correctly, you can ignore this message. Building eval loader… No preprocessor was supplied and no preprocessing function is registered for dataset name “local_dataset”. No additional preprocessing will be applied. If the dataset is already formatted correctly, you can ignore this message. Building trainer…

----------End global rank 1 STDOUT---------- ----------Begin global rank 1 STDERR---------- Explicitly passing a revision is encouraged when loading a configuration with custom code to ensure no malicious code has been contributed in a newer revision. Explicitly passing a revision is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision. /root/.cache/huggingface/modules/transformers_modules/mosaicml/mpt-7b/14958374ab073ba1030c0caef4ae8380045bae45/attention.py:153: UserWarning: While attn_impl: triton can be faster than attn_impl: flash it uses more memory. When training larger models this can trigger alloc retries which hurts performance. If encountered, we recommend using attn_impl: flash if your model does not use alibi or prefix_lm. warnings.warn('While attn_impl: triton can be faster than attn_impl: flash ’ + 'it uses more memory. When training larger models this can trigger ’ + 'alloc retries which hurts performance. If encountered, we recommend ’ + ‘using attn_impl: flash if your model does not use alibi or prefix_lm.’)

Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s] Loading checkpoint shards: 50%|████████████████████████████████████████████████████████████ | 1/2 [00:08<00:08, 8.54s/it] Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:10<00:00, 4.94s/it] Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:10<00:00, 5.48s/it] Using pad_token, but it is not set yet. Found cached dataset json (/root/.cache/huggingface/datasets/json/local_dataset-f2d32eae91a10861/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51) Loading cached processed dataset at /root/.cache/huggingface/datasets/json/local_dataset-f2d32eae91a10861/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51/cache-e96cabec8ddb1637.arrow Found cached dataset json (/root/.cache/huggingface/datasets/json/local_dataset-f2d32eae91a10861/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51) Loading cached processed dataset at /root/.cache/huggingface/datasets/json/local_dataset-f2d32eae91a10861/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51/cache-76316eeb9f4e44d9.arrow [E ProcessGroupNCCL.cpp:821] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, Timeout(ms)=600000) ran for 601403 milliseconds before timing out. [E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down. terminate called after throwing an instance of ‘std::runtime_error’ what(): [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, Timeout(ms)=600000) ran for 601403 milliseconds before timing out.

----------End global rank 1 STDERR---------- ERROR:composer.cli.launcher:Global rank 0 (PID 786) exited with code -6 /mpt-7b/llm-foundry/scripts/train#

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Reactions: 1
  • Comments: 15 (2 by maintainers)

Most upvoted comments

If you haven’t already, can you try working off one of the recommended images in the top-level README, making sure that your code is up-to-date with the main branch, and re-installing to get all the latest dependencies? Basically, I’m wondering if this happens after following the install/set-up instructions in the README.

NCCL errors are notoriously hard to diagnose, so it’d be helpful to see if this is just an environment issue. But, honestly, there’s not a lot to go off of here, so I can’t make any promises.

Are people still running into this problem?