transformers: Deepspeed Stage3 using trainer and base DONUT model results in RecursionError.
System Info
- Running on AzureML Standard_NC6S_V3 with curated environment: AzureML-ACPT-pytorch-1.12-py39-cuda11.6-gpu
transformers
version: 4.26.0- Platform: Linux-5.0.0-1036-azure-x86_64-with-glibc2.31
- Python version: 3.9.15
- Huggingface_hub version: 0.12.0
- PyTorch version (GPU?): 1.12.1 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: Through trainer
- Using distributed or parallel set-up in script?: Through deepspeed/trainer
Who can help?
I am using a base DONUT model, The error only happens with Deepspeed stage3: @stas00
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, …) - My own task or dataset (give details below)
Reproduction
I am fine-tuning a DONUT based model on an Azure Standard_NC6S_V3 (1 x V100 (16GB)) using AzureML. Below is a minimal example to reproduce the recursion error.
# Train script
import transformers
from transformers import (
DonutProcessor,
Seq2SeqTrainer,
Seq2SeqTrainingArguments,
VisionEncoderDecoderModel,
)
from PIL import Image
import datasets
base_model = "naver-clova-ix/donut-base"
image_size = { "width": 680, "height": 960 }
def main():
# Main
training_args = Seq2SeqTrainingArguments(
output_dir='./output',
num_train_epochs=1,
per_device_train_batch_size=2,
fp16=True,
deepspeed='deepspeed_config.json',
)
model = VisionEncoderDecoderModel.from_pretrained(base_model)
processor = DonutProcessor.from_pretrained(base_model)
# Resize image size in model/processor
processor.image_processor.size = image_size
model.config.encoder.image_size = tuple(processor.image_processor.size.values())[::-1]
model.config.hidden_size = model.config.encoder.hidden_size # Deepspeed needs this fix
# Generate bogus dataset
image = Image.new('RGB', (image_size['width'], image_size['height']))
text = '{"great_key": "great_value"}'
N = 16
data = [{'image': image, 'text': text} for _ in range(N)]
dataset = datasets.Dataset.from_list(data)
# Tokenize bogus dataset
def tokenize(example, processor):
pixel_values = processor(
example["image"],
random_padding=True,
return_tensors="pt",
).pixel_values.squeeze()
input_ids = processor.tokenizer( # type: ignore
example["text"],
add_special_tokens=False,
max_length=512,
padding="max_length",
truncation=True,
return_tensors="pt",
)["input_ids"].squeeze(0)
labels = input_ids.clone()
return {
"pixel_values": pixel_values,
"labels": labels,
"target_sequence": example["text"],
}
input_dataset = dataset.map(
lambda x: tokenize(x, processor),
remove_columns=['image', 'text'],
)
# Train
trainer = Seq2SeqTrainer(
model=model,
args=training_args,
train_dataset=input_dataset,
)
trainer.remove_callback(transformers.integrations.AzureMLCallback)
trainer.train()
if __name__ == "__main__":
main()
{
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"offload_param": {
"device": "cpu",
"pin_memory": true
},
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_16bit_weights_on_model_save": true
},
"train_batch_size": "auto",
"fp16": {
"enabled": "auto"
}
}
Probably not relevant but here the submit job script.
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential
from azure.ai.ml import command
compute_name = ""
environment_name = ""
ml_client = MLClient.from_config(
credential=DefaultAzureCredential(),
path='/',
)
environment = ml_client.environments.get(environment_name, label="latest")
fail_job = command(
code='./fail_train',
command="transformers-cli env && deepspeed --num_gpus 1 failure_train_script.py",
compute=compute_name,
environment=environment,
)
job = ml_client.jobs.create_or_update(
fail_job,
experiment_name="testing",
)
Expected behavior
When using Deepspeed stage2 all is working but for large images I get OOM on the V100 16GB GPU. Therefore, I want to try Deepspeed stage3 but this results in the maximum recursion error.
From what I have read, the recursion error is due to deepspeed’s zero initialisation, however these bits are a bit hidden when using trainer and I am not sure where to look. I am more than happy to investigate but I definitely need some guidance (-:
I expect training to start with hopefully some memory savings such that I can train a DONUT based model on V100 or smaller GPU.
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 20 (18 by maintainers)
thanks for your response. i had posted an issue there. thanks again.
at the moment, yes
If you read my bug report https://github.com/microsoft/DeepSpeed/issues/2811 it already asks your exact questions:
And there is a 2nd problem that will emerge if the first one is fixed, see: https://github.com/microsoft/DeepSpeed/issues/2812 - I discovered it some months back but also yesterday when I was hoping to give you a simpler hack - specifically in the diff I shared disabling zero.Init only for
from_config
. I have some hacky ideas to solve it, but not yet an elegant solution.I will ponder meanwhile how we can fix this on the integration side. This should be totally doable, just need to find an elegant way of doing that.
Mind you, composed models is a new thing, so a new need calls for a new solution.
Hi @stas00,
Thanks for the elaborate answers and way of thought.
Let me rephrase from what I understood:
deepspeed.zero.Init
should only be called once. This is something I have seen mentioned in other issues in the Deepspeed repo as well. As we have an encoder + decoder, we practically have two models, which each do adeepspeed.zero.init
during the.from_config
method.What is unclear to me is who to “blame” (in a positive sense (-😉. If we are only suppose to call
deepspeed.zero.init
once, something in transformers should be fixed, while if nesteddeepspeed.zero.init
should be allowed (as in your minimal example), Deepspeed needs a fix.Just thinking out loud.
I will try your suggested hacky fix and will report later.