transformers: Trainer Stuck at 0% Progress during Training on Multi-GPU Setup

System Info

  • transformers version: 4.33.3
  • Platform: Linux-4.15.0-213-generic-x86_64-with-glibc2.27
  • Python version: 3.10.13
  • Huggingface_hub version: 0.17.3
  • Safetensors version: 0.3.3
  • Accelerate version: 0.23.0
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.1.0+cu121 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Machine: 8 x A800 GPUs

Who can help?

@ArthurZucker @younesbelkada @pacman100 @muellerzr

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, …)
  • My own task or dataset (give details below)

Reproduction

I am running the following script:

python ContinuePretrainLlama.py \
--data_path ./PretrainData/kyoto-train.txt \
--output_dir ./llama2  \
--num_train_epochs 1 \
--per_device_train_batch_size 8 \
--save_strategy no \

The core parts of the code that may be related to the issue are:

def preprocess(data, tokenizer):
    input_encoding = tokenizer(data["text"], truncation=True, max_length=2048, padding="max_length",
                               return_tensors="np", return_special_tokens_mask=True, return_attention_mask=False)

    # Create the final dictionary
    result = {
        "input_ids": input_encoding["input_ids"],
        "special_tokens_mask": input_encoding["special_tokens_mask"],
    }

    return result


def make_pretrain_data_module(tokenizer: transformers.PreTrainedTokenizer, data_path, model) -> Dict:
    """Make dataset and collator for supervised fine-tuning."""
    # train_dataset = PretrainedDataset(tokenizer=tokenizer, data_path=data_path)
    dataset = load_dataset('text', data_files=data_path)
    train_dataset = \
        dataset.map(lambda data: preprocess(data, tokenizer), batched=True, desc="Processing", remove_columns=["text"])[
            "train"]
    data_collator = transformers.DataCollatorForLanguageModeling(tokenizer=tokenizer,
                                                                 mlm=False,
                                                                 return_tensors="pt")
    return dict(train_dataset=train_dataset, eval_dataset=None, data_collator=data_collator)


def smart_tokenizer_and_embedding_resize(
        special_tokens_dict: Dict,
        tokenizer: transformers.PreTrainedTokenizer,
        model: transformers.PreTrainedModel,
):
    """Resize tokenizer and embedding.

    Note: This is the unoptimized version that may make your embedding size not be divisible by 64.
    """
    num_new_tokens = tokenizer.add_special_tokens(special_tokens_dict)
    model.resize_token_embeddings(len(tokenizer), pad_to_multiple_of=8)

    if num_new_tokens > 0:
        input_embeddings = model.get_input_embeddings().weight.data
        output_embeddings = model.get_output_embeddings().weight.data

        input_embeddings_avg = input_embeddings[:-num_new_tokens].mean(dim=0, keepdim=True)
        output_embeddings_avg = output_embeddings[:-num_new_tokens].mean(dim=0, keepdim=True)

        input_embeddings[-num_new_tokens:] = input_embeddings_avg
        output_embeddings[-num_new_tokens:] = output_embeddings_avg


def train():
    parser = transformers.HfArgumentParser((ModelArguments, DataArguments, TrainingArguments))
    model_args, data_args, training_args = parser.parse_args_into_dataclasses()

    if model_args.tokenizer_path is None:
        model_args.tokenizer_path = model_args.model_name_or_path

    tokenizer = LlamaTokenizer.from_pretrained(model_args.tokenizer_path,
                                               model_max_length=training_args.model_max_length, padding_side="right",
                                               use_fast=False, cache_dir=training_args.cache_dir)
    special_tokens_dict = dict()
    if tokenizer.pad_token is None:
        special_tokens_dict["pad_token"] = DEFAULT_PAD_TOKEN
    if tokenizer.eos_token is None:
        special_tokens_dict["eos_token"] = DEFAULT_EOS_TOKEN
    if tokenizer.bos_token is None:
        special_tokens_dict["bos_token"] = DEFAULT_BOS_TOKEN
    if tokenizer.unk_token is None:
        special_tokens_dict["unk_token"] = DEFAULT_UNK_TOKEN

    model = LlamaForCausalLM.from_pretrained(model_args.model_name_or_path).cuda(0)
    smart_tokenizer_and_embedding_resize(
        special_tokens_dict=special_tokens_dict,
        tokenizer=tokenizer,
        model=model,
    )
    if model_args.peft_lora:
        if model_args.lora_config is None:
            raise ValueError("Please specify the path to the PEFT config.")
        lora_config = LoraConfig(**LoraConfig.from_json_file(model_args.lora_config))
        model = get_peft_model(model, lora_config)
        print("You are using lora model!\n")
        model.print_trainable_parameters()

    data_module = make_pretrain_data_module(tokenizer=tokenizer, data_path=data_args.data_path, model=model)
    trainer = Trainer(model=model, tokenizer=tokenizer, args=training_args, **data_module)
    trainer.train()
    trainer.save_state()
    trainer.save_model(output_dir=training_args.output_dir)


if __name__ == "__main__":
    train()

As we can see, the GPU memory are successfully allocated image

It stuck at 0% more than 20 hours image

And I can’t stop it with Control + C image

This is the dashboard link in wandb https://wandb.ai/innovation_club/huggingface/runs/brca7vz5?workspace=user-

Additional Information: Upon testing, the code runs perfectly on the CPU. However, when I shift to a multi-GPU setup, the training process doesn’t proceed. It’s essential to note that the memory allocation on the GPUs does take place, indicating that the process has initiated, but no forward or backward computations are observed.

Expected behavior

Would appreciate any insights or suggestions on resolving this. Thank you!

About this issue

  • Original URL
  • State: closed
  • Created 9 months ago
  • Reactions: 1
  • Comments: 23 (4 by maintainers)

Most upvoted comments

I encounter the same issue. Anyone got a solution?

It seems work out there, https://github.com/NVIDIA/nccl/issues/1027 You should add NCCL_P2P_DISABLE=1 before your command.

I am working on a cluster with SLURM - it took me forever to resolve this issue; however, when I changed the ddp backend to --ddp_backend gloo, it finally worked!