transformers: Deepspeed Stage3 using trainer and base DONUT model results in RecursionError.

System Info

  • Running on AzureML Standard_NC6S_V3 with curated environment: AzureML-ACPT-pytorch-1.12-py39-cuda11.6-gpu
  • transformers version: 4.26.0
  • Platform: Linux-5.0.0-1036-azure-x86_64-with-glibc2.31
  • Python version: 3.9.15
  • Huggingface_hub version: 0.12.0
  • PyTorch version (GPU?): 1.12.1 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: Through trainer
  • Using distributed or parallel set-up in script?: Through deepspeed/trainer

Who can help?

I am using a base DONUT model, The error only happens with Deepspeed stage3: @stas00

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, …)
  • My own task or dataset (give details below)

Reproduction

I am fine-tuning a DONUT based model on an Azure Standard_NC6S_V3 (1 x V100 (16GB)) using AzureML. Below is a minimal example to reproduce the recursion error.

# Train script
import transformers
from transformers import (
    DonutProcessor,
    Seq2SeqTrainer,
    Seq2SeqTrainingArguments,
    VisionEncoderDecoderModel,
)
from PIL import Image
import datasets


base_model = "naver-clova-ix/donut-base"
image_size = { "width": 680, "height": 960 }


def main():
    # Main
    training_args = Seq2SeqTrainingArguments(
        output_dir='./output',
        num_train_epochs=1,
        per_device_train_batch_size=2,
        fp16=True,
        deepspeed='deepspeed_config.json',
    )

    model = VisionEncoderDecoderModel.from_pretrained(base_model)
    processor = DonutProcessor.from_pretrained(base_model)
    
    # Resize image size in model/processor
    processor.image_processor.size = image_size
    model.config.encoder.image_size = tuple(processor.image_processor.size.values())[::-1]
    model.config.hidden_size = model.config.encoder.hidden_size  # Deepspeed needs this fix


    # Generate bogus dataset
    image = Image.new('RGB', (image_size['width'], image_size['height']))
    text = '{"great_key": "great_value"}'
    N = 16
    data = [{'image': image, 'text': text} for _ in range(N)]
    dataset = datasets.Dataset.from_list(data)

    # Tokenize bogus dataset
    def tokenize(example, processor):
        pixel_values = processor(
            example["image"],
            random_padding=True,
            return_tensors="pt",
        ).pixel_values.squeeze()

        input_ids = processor.tokenizer(  # type: ignore
            example["text"],
            add_special_tokens=False,
            max_length=512,
            padding="max_length",
            truncation=True,
            return_tensors="pt",
        )["input_ids"].squeeze(0)

        labels = input_ids.clone()

        return {
            "pixel_values": pixel_values,
            "labels": labels,
            "target_sequence": example["text"],
        }

    input_dataset = dataset.map(
        lambda x: tokenize(x, processor),
        remove_columns=['image', 'text'],
    )

    # Train
    trainer = Seq2SeqTrainer(
        model=model,
        args=training_args,
        train_dataset=input_dataset,
    )

    trainer.remove_callback(transformers.integrations.AzureMLCallback)

    trainer.train()

if __name__ == "__main__":
    main()
{
  "zero_optimization": {
    "stage": 3,
    "offload_optimizer": {
      "device": "cpu",
      "pin_memory": true
    },
    "offload_param": {
      "device": "cpu",
      "pin_memory": true
    },
    "overlap_comm": true,
    "contiguous_gradients": true,
    "sub_group_size": 1e9,
    "reduce_bucket_size": "auto",
    "stage3_prefetch_bucket_size": "auto",
    "stage3_param_persistence_threshold": "auto",
    "stage3_max_live_parameters": 1e9,
    "stage3_max_reuse_distance": 1e9,
    "stage3_gather_16bit_weights_on_model_save": true
  },
  "train_batch_size": "auto", 
  "fp16": {
        "enabled": "auto"
  }
}

Probably not relevant but here the submit job script.

from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential
from azure.ai.ml import command

compute_name = ""
environment_name = ""


ml_client = MLClient.from_config(
    credential=DefaultAzureCredential(),
    path='/', 
)
environment = ml_client.environments.get(environment_name, label="latest")

fail_job = command(
    code='./fail_train',
    command="transformers-cli env && deepspeed --num_gpus 1 failure_train_script.py",
    compute=compute_name,
    environment=environment,
)

job = ml_client.jobs.create_or_update(
    fail_job,
    experiment_name="testing",
)

Expected behavior

When using Deepspeed stage2 all is working but for large images I get OOM on the V100 16GB GPU. Therefore, I want to try Deepspeed stage3 but this results in the maximum recursion error.

From what I have read, the recursion error is due to deepspeed’s zero initialisation, however these bits are a bit hidden when using trainer and I am not sure where to look. I am more than happy to investigate but I definitely need some guidance (-:

I expect training to start with hopefully some memory savings such that I can train a DONUT based model on V100 or smaller GPU.

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 20 (18 by maintainers)

Most upvoted comments

Thank you for testing microsoft/DeepSpeed#2989, @dumpmemory - sorry to hear it didn’t resolve the leak - perhaps file a new issue in DS, as the one I posted I couldn’t provide a repro script as it was part of the complex system, but perhaps you could. That should help a lot with solving it.

thanks for your response. i had posted an issue there. thanks again.

deepspeed.zero.Init should only be called once

at the moment, yes

What is unclear to me is who to “blame” (in a positive sense (-😉. …

If you read my bug report https://github.com/microsoft/DeepSpeed/issues/2811 it already asks your exact questions:

And there is a 2nd problem that will emerge if the first one is fixed, see: https://github.com/microsoft/DeepSpeed/issues/2812 - I discovered it some months back but also yesterday when I was hoping to give you a simpler hack - specifically in the diff I shared disabling zero.Init only for from_config. I have some hacky ideas to solve it, but not yet an elegant solution.

I will ponder meanwhile how we can fix this on the integration side. This should be totally doable, just need to find an elegant way of doing that.

Mind you, composed models is a new thing, so a new need calls for a new solution.

Hi @stas00,

Thanks for the elaborate answers and way of thought.

Let me rephrase from what I understood: deepspeed.zero.Init should only be called once. This is something I have seen mentioned in other issues in the Deepspeed repo as well. As we have an encoder + decoder, we practically have two models, which each do a deepspeed.zero.init during the .from_config method.

What is unclear to me is who to “blame” (in a positive sense (-😉. If we are only suppose to call deepspeed.zero.init once, something in transformers should be fixed, while if nested deepspeed.zero.init should be allowed (as in your minimal example), Deepspeed needs a fix.

Just thinking out loud.

I will try your suggested hacky fix and will report later.