NeMo: GPT Training process hangs during sanity checking dataloader with TP=2, PP=2 and bias=false

Describe the bug

With TP=2 ,PP=2 and model.bias=false, GPT Training process hangs during dataloader sanity checking

Steps/Code to reproduce bug Clone r1.15.0 realese branch and then run the offical GPT pretrain example with my trivial modification

git clone https://github.com/NVIDIA/NeMo.git
cd Nemo
git branch r1.15.0
bash reinstall.sh
cd examples/nlp/language_modeling
bash train.sh

Below is the content of train.sh:

MASTER_ADDR=${MASTER_ADDR:-"127.0.0.1"}
MASTER_PORT=29503

NUM_NODES=1
NODE_RANK=0
GPUS_PER_NODE=8

DATA_PREFIX="[1.0,/data/pretrain/openwebtext2/hfbpe_gpt_training_data_text_document]"

torchrun --nnodes ${NUM_NODES} --nproc_per_node ${GPUS_PER_NODE} \
    --master_addr ${MASTER_ADDR} --master_port ${MASTER_PORT} --node_rank ${NODE_RANK} \
    ./megatron_gpt_pretraining.py  \
    --config-name=megatron_gpt_config.yaml \
    trainer.devices=${GPUS_PER_NODE} \
    trainer.num_nodes=${NUM_NODES} \
    model.data.data_prefix=${DATA_PREFIX} \
    model.tensor_model_parallel_size=2 \
    model.pipeline_model_parallel_size=2 \
    +model.bias=false \
    model.bias_activation_fusion=false \
    model.bias_dropout_add_fusion=false \

As we can see, the modification here is just:

disable all the bias item of linear layer in GPT model
diabale bias related CUDA kernal fusion, as we do not need bias yet

The program hangs during dataloader sanity checking at very early stage of training.

It is OK if I turn on the bias and the related kernal fusion, or if I set PP=1

Logically, there should be no relevance between bias and pipeline. So I think it is an unexpected behaviour.

Expected behavior

It is expected that NeMo is able to train very large model without any bias term in pipeline parallel mode, such as LLama-65B.

Environment overview (please complete the following information)

Environment location: Docker
Method of NeMo install: bash reinstall.sh
If method of install is [Docker], provide docker pull nvcr.io/nvidia/nemo:23.01 & docker run --shm-size 64g --network=host -v ${HOME}:${HOME} -v ${DATAROOT}:/data -it nvcr.io/nvidia/nemo:23.01 bash commands used

About this issue

Original URL
State: closed
Created a year ago
Comments: 15 (4 by maintainers)

Most upvoted comments

This problem still occurs at the latest release(1.19.0). Please handle it, at least by applying patch by @flymark2010.

absol13 on Jul 11, 2023

Hi @ericharper . We try to run this demo using version r1.17.0 in the corresponding official docker container(nvcr.io/nvidia/nemo:23.02 without reinstall anymore). No single line code modified. The training process still hangs into dataloader sanity checking

Following is the run.sh:

NUM_NODES=1
GPUS_PER_NODE=8

DATA_PREFIX="[1.0,/data/pretrain/openwebtext2/hfbpe_gpt_training_data_text_document]"
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
python ./megatron_gpt_pretraining.py  \
    --config-name=megatron_gpt_config.yaml \
    trainer.devices=${GPUS_PER_NODE} \
    trainer.num_nodes=${NUM_NODES} \
    model.data.data_prefix=${DATA_PREFIX} \
    model.tensor_model_parallel_size=2 \
    model.pipeline_model_parallel_size=2 \
    model.bias=false \
    model.bias_activation_fusion=false \
    model.bias_dropout_add_fusion=false \

hiyijian on May 16, 2023