transformers: Trainer Stuck at 0% Progress during Training on Multi-GPU Setup
System Info
transformersversion: 4.33.3- Platform: Linux-4.15.0-213-generic-x86_64-with-glibc2.27
- Python version: 3.10.13
- Huggingface_hub version: 0.17.3
- Safetensors version: 0.3.3
- Accelerate version: 0.23.0
- Accelerate config: not found
- PyTorch version (GPU?): 2.1.0+cu121 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Machine: 8 x A800 GPUs
Who can help?
@ArthurZucker @younesbelkada @pacman100 @muellerzr
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, …) - My own task or dataset (give details below)
Reproduction
I am running the following script:
python ContinuePretrainLlama.py \
--data_path ./PretrainData/kyoto-train.txt \
--output_dir ./llama2 \
--num_train_epochs 1 \
--per_device_train_batch_size 8 \
--save_strategy no \
The core parts of the code that may be related to the issue are:
def preprocess(data, tokenizer):
input_encoding = tokenizer(data["text"], truncation=True, max_length=2048, padding="max_length",
return_tensors="np", return_special_tokens_mask=True, return_attention_mask=False)
# Create the final dictionary
result = {
"input_ids": input_encoding["input_ids"],
"special_tokens_mask": input_encoding["special_tokens_mask"],
}
return result
def make_pretrain_data_module(tokenizer: transformers.PreTrainedTokenizer, data_path, model) -> Dict:
"""Make dataset and collator for supervised fine-tuning."""
# train_dataset = PretrainedDataset(tokenizer=tokenizer, data_path=data_path)
dataset = load_dataset('text', data_files=data_path)
train_dataset = \
dataset.map(lambda data: preprocess(data, tokenizer), batched=True, desc="Processing", remove_columns=["text"])[
"train"]
data_collator = transformers.DataCollatorForLanguageModeling(tokenizer=tokenizer,
mlm=False,
return_tensors="pt")
return dict(train_dataset=train_dataset, eval_dataset=None, data_collator=data_collator)
def smart_tokenizer_and_embedding_resize(
special_tokens_dict: Dict,
tokenizer: transformers.PreTrainedTokenizer,
model: transformers.PreTrainedModel,
):
"""Resize tokenizer and embedding.
Note: This is the unoptimized version that may make your embedding size not be divisible by 64.
"""
num_new_tokens = tokenizer.add_special_tokens(special_tokens_dict)
model.resize_token_embeddings(len(tokenizer), pad_to_multiple_of=8)
if num_new_tokens > 0:
input_embeddings = model.get_input_embeddings().weight.data
output_embeddings = model.get_output_embeddings().weight.data
input_embeddings_avg = input_embeddings[:-num_new_tokens].mean(dim=0, keepdim=True)
output_embeddings_avg = output_embeddings[:-num_new_tokens].mean(dim=0, keepdim=True)
input_embeddings[-num_new_tokens:] = input_embeddings_avg
output_embeddings[-num_new_tokens:] = output_embeddings_avg
def train():
parser = transformers.HfArgumentParser((ModelArguments, DataArguments, TrainingArguments))
model_args, data_args, training_args = parser.parse_args_into_dataclasses()
if model_args.tokenizer_path is None:
model_args.tokenizer_path = model_args.model_name_or_path
tokenizer = LlamaTokenizer.from_pretrained(model_args.tokenizer_path,
model_max_length=training_args.model_max_length, padding_side="right",
use_fast=False, cache_dir=training_args.cache_dir)
special_tokens_dict = dict()
if tokenizer.pad_token is None:
special_tokens_dict["pad_token"] = DEFAULT_PAD_TOKEN
if tokenizer.eos_token is None:
special_tokens_dict["eos_token"] = DEFAULT_EOS_TOKEN
if tokenizer.bos_token is None:
special_tokens_dict["bos_token"] = DEFAULT_BOS_TOKEN
if tokenizer.unk_token is None:
special_tokens_dict["unk_token"] = DEFAULT_UNK_TOKEN
model = LlamaForCausalLM.from_pretrained(model_args.model_name_or_path).cuda(0)
smart_tokenizer_and_embedding_resize(
special_tokens_dict=special_tokens_dict,
tokenizer=tokenizer,
model=model,
)
if model_args.peft_lora:
if model_args.lora_config is None:
raise ValueError("Please specify the path to the PEFT config.")
lora_config = LoraConfig(**LoraConfig.from_json_file(model_args.lora_config))
model = get_peft_model(model, lora_config)
print("You are using lora model!\n")
model.print_trainable_parameters()
data_module = make_pretrain_data_module(tokenizer=tokenizer, data_path=data_args.data_path, model=model)
trainer = Trainer(model=model, tokenizer=tokenizer, args=training_args, **data_module)
trainer.train()
trainer.save_state()
trainer.save_model(output_dir=training_args.output_dir)
if __name__ == "__main__":
train()
As we can see, the GPU memory are successfully allocated
It stuck at 0% more than 20 hours
And I can’t stop it with Control + C
This is the dashboard link in wandb https://wandb.ai/innovation_club/huggingface/runs/brca7vz5?workspace=user-
Additional Information: Upon testing, the code runs perfectly on the CPU. However, when I shift to a multi-GPU setup, the training process doesn’t proceed. It’s essential to note that the memory allocation on the GPUs does take place, indicating that the process has initiated, but no forward or backward computations are observed.
Expected behavior
Would appreciate any insights or suggestions on resolving this. Thank you!
About this issue
- Original URL
- State: closed
- Created 9 months ago
- Reactions: 1
- Comments: 23 (4 by maintainers)
I finally solved this by disabling ACS in bios, ref https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html#pci-access-control-services-acs. Changing nvidia driver and cuda version doesn’t help.
This test is very helpful. https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html#gpu-to-gpu-communication
It seems work out there, https://github.com/NVIDIA/nccl/issues/1027 You should add NCCL_P2P_DISABLE=1 before your command.
I am working on a cluster with SLURM - it took me forever to resolve this issue; however, when I changed the ddp backend to
--ddp_backend gloo, it finally worked!