NeMo: GPT Training process hangs during sanity checking dataloader with TP=2, PP=2 and bias=false
Describe the bug
With TP=2 ,PP=2 and model.bias=false, GPT Training process hangs during dataloader sanity checking
Steps/Code to reproduce bug Clone r1.15.0 realese branch and then run the offical GPT pretrain example with my trivial modification
git clone https://github.com/NVIDIA/NeMo.git
cd Nemo
git branch r1.15.0
bash reinstall.sh
cd examples/nlp/language_modeling
bash train.sh
Below is the content of train.sh:
MASTER_ADDR=${MASTER_ADDR:-"127.0.0.1"}
MASTER_PORT=29503
NUM_NODES=1
NODE_RANK=0
GPUS_PER_NODE=8
DATA_PREFIX="[1.0,/data/pretrain/openwebtext2/hfbpe_gpt_training_data_text_document]"
torchrun --nnodes ${NUM_NODES} --nproc_per_node ${GPUS_PER_NODE} \
--master_addr ${MASTER_ADDR} --master_port ${MASTER_PORT} --node_rank ${NODE_RANK} \
./megatron_gpt_pretraining.py \
--config-name=megatron_gpt_config.yaml \
trainer.devices=${GPUS_PER_NODE} \
trainer.num_nodes=${NUM_NODES} \
model.data.data_prefix=${DATA_PREFIX} \
model.tensor_model_parallel_size=2 \
model.pipeline_model_parallel_size=2 \
+model.bias=false \
model.bias_activation_fusion=false \
model.bias_dropout_add_fusion=false \
As we can see, the modification here is just:
- disable all the bias item of linear layer in GPT model
- diabale bias related CUDA kernal fusion, as we do not need bias yet
The program hangs during dataloader sanity checking at very early stage of training.
It is OK if I turn on the bias and the related kernal fusion, or if I set PP=1
Logically, there should be no relevance between bias and pipeline. So I think it is an unexpected behaviour.
Expected behavior
It is expected that NeMo is able to train very large model without any bias term in pipeline parallel mode, such as LLama-65B.
Environment overview (please complete the following information)
- Environment location: Docker
- Method of NeMo install:
bash reinstall.sh
- If method of install is [Docker], provide
docker pull nvcr.io/nvidia/nemo:23.01
&docker run --shm-size 64g --network=host -v ${HOME}:${HOME} -v ${DATAROOT}:/data -it nvcr.io/nvidia/nemo:23.01 bash
commands used
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 15 (4 by maintainers)
This problem still occurs at the latest release(1.19.0). Please handle it, at least by applying patch by @flymark2010.
Hi @ericharper . We try to run this demo using version r1.17.0 in the corresponding official docker container(nvcr.io/nvidia/nemo:23.02 without reinstall anymore). No single line code modified. The training process still hangs into dataloader sanity checking
Following is the run.sh: