NeMo: GPT Training process hangs during sanity checking dataloader with TP=2, PP=2 and bias=false

Describe the bug

With TP=2 ,PP=2 and model.bias=false, GPT Training process hangs during dataloader sanity checking

Steps/Code to reproduce bug Clone r1.15.0 realese branch and then run the offical GPT pretrain example with my trivial modification

git clone https://github.com/NVIDIA/NeMo.git
cd Nemo
git branch r1.15.0
bash reinstall.sh
cd examples/nlp/language_modeling
bash train.sh

Below is the content of train.sh:

MASTER_ADDR=${MASTER_ADDR:-"127.0.0.1"}
MASTER_PORT=29503

NUM_NODES=1
NODE_RANK=0
GPUS_PER_NODE=8

DATA_PREFIX="[1.0,/data/pretrain/openwebtext2/hfbpe_gpt_training_data_text_document]"

torchrun --nnodes ${NUM_NODES} --nproc_per_node ${GPUS_PER_NODE} \
    --master_addr ${MASTER_ADDR} --master_port ${MASTER_PORT} --node_rank ${NODE_RANK} \
    ./megatron_gpt_pretraining.py  \
    --config-name=megatron_gpt_config.yaml \
    trainer.devices=${GPUS_PER_NODE} \
    trainer.num_nodes=${NUM_NODES} \
    model.data.data_prefix=${DATA_PREFIX} \
    model.tensor_model_parallel_size=2 \
    model.pipeline_model_parallel_size=2 \
    +model.bias=false \
    model.bias_activation_fusion=false \
    model.bias_dropout_add_fusion=false \

As we can see, the modification here is just:

  • disable all the bias item of linear layer in GPT model
  • diabale bias related CUDA kernal fusion, as we do not need bias yet

The program hangs during dataloader sanity checking at very early stage of training.

It is OK if I turn on the bias and the related kernal fusion, or if I set PP=1

Logically, there should be no relevance between bias and pipeline. So I think it is an unexpected behaviour.

Expected behavior

It is expected that NeMo is able to train very large model without any bias term in pipeline parallel mode, such as LLama-65B.

Environment overview (please complete the following information)

  • Environment location: Docker
  • Method of NeMo install: bash reinstall.sh
  • If method of install is [Docker], provide docker pull nvcr.io/nvidia/nemo:23.01 & docker run --shm-size 64g --network=host -v ${HOME}:${HOME} -v ${DATAROOT}:/data -it nvcr.io/nvidia/nemo:23.01 bash commands used

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 15 (4 by maintainers)

Most upvoted comments

This problem still occurs at the latest release(1.19.0). Please handle it, at least by applying patch by @flymark2010.

Hi @ericharper . We try to run this demo using version r1.17.0 in the corresponding official docker container(nvcr.io/nvidia/nemo:23.02 without reinstall anymore). No single line code modified. The training process still hangs into dataloader sanity checking

Following is the run.sh:

NUM_NODES=1
GPUS_PER_NODE=8

DATA_PREFIX="[1.0,/data/pretrain/openwebtext2/hfbpe_gpt_training_data_text_document]"
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
python ./megatron_gpt_pretraining.py  \
    --config-name=megatron_gpt_config.yaml \
    trainer.devices=${GPUS_PER_NODE} \
    trainer.num_nodes=${NUM_NODES} \
    model.data.data_prefix=${DATA_PREFIX} \
    model.tensor_model_parallel_size=2 \
    model.pipeline_model_parallel_size=2 \
    model.bias=false \
    model.bias_activation_fusion=false \
    model.bias_dropout_add_fusion=false \