transformers: Save model checkpoint error when multi-gpu training still happens on 4.36.1

System Info

platform: linux python: 3.9 transformers: 4.36.1 running on two A10.2

Who can help?

No response

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, …)
My own task or dataset (give details below)

Reproduction

I saw the release notes of 4.36.1 says this error already fixed, however, it still repeats after I install the latest version when I am running on a two A10.2 machine.

                                                 Traceback (most recent call last):
2023-12-17 18:09:08 10.0.1.12:   File "/home/datascience/conda/pytorch2_0forgpuonpython3_9_vziqun/lib/python3.9/runpy.py", line 197, in _run_module_as_main
2023-12-17 18:09:08 10.0.1.12:     return _run_code(code, main_globals, None,
2023-12-17 18:09:08 10.0.1.12:   File "/home/datascience/conda/pytorch2_0forgpuonpython3_9_vziqun/lib/python3.9/runpy.py", line 87, in _run_code
2023-12-17 18:09:08 10.0.1.12:     exec(code, run_globals)
2023-12-17 18:09:08 10.0.1.12:   File "/home/datascience/decompressed_artifact/code/src/axolotl/cli/train.py", line 38, in <module>
2023-12-17 18:09:08 10.0.1.12:     fire.Fire(do_cli)
2023-12-17 18:09:08 10.0.1.12:   File "/home/datascience/conda/pytorch2_0forgpuonpython3_9_vziqun/lib/python3.9/site-packages/fire/core.py", line 141, in Fire
2023-12-17 18:09:08 10.0.1.12:     component_trace = _Fire(component, args, parsed_flag_args, context, name)
2023-12-17 18:09:08 10.0.1.12:   File "/home/datascience/conda/pytorch2_0forgpuonpython3_9_vziqun/lib/python3.9/site-packages/fire/core.py", line 475, in _Fire
2023-12-17 18:09:08 10.0.1.12:     component, remaining_args = _CallAndUpdateTrace(
2023-12-17 18:09:08 10.0.1.12:   File "/home/datascience/conda/pytorch2_0forgpuonpython3_9_vziqun/lib/python3.9/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
2023-12-17 18:09:08 10.0.1.12:     component = fn(*varargs, **kwargs)
2023-12-17 18:09:08 10.0.1.12:   File "/home/datascience/decompressed_artifact/code/src/axolotl/cli/train.py", line 34, in do_cli
2023-12-17 18:09:08 10.0.1.12:     train(cfg=parsed_cfg, cli_args=parsed_cli_args, dataset_meta=dataset_meta)
2023-12-17 18:09:08 10.0.1.12:   File "/home/datascience/decompressed_artifact/code/src/axolotl/train.py", line 126, in train
2023-12-17 18:09:08 10.0.1.12:     trainer.train(resume_from_checkpoint=resume_from_checkpoint)
2023-12-17 18:09:08 10.0.1.12:   File "/home/datascience/conda/pytorch2_0forgpuonpython3_9_vziqun/lib/python3.9/site-packages/transformers/trainer.py", line 1537, in train
2023-12-17 18:09:08 10.0.1.12:     return inner_training_loop(
2023-12-17 18:09:08 10.0.1.12:   File "/home/datascience/conda/pytorch2_0forgpuonpython3_9_vziqun/lib/python3.9/site-packages/transformers/trainer.py", line 1929, in _inner_training_loop
2023-12-17 18:09:08 10.0.1.12:     self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
2023-12-17 18:09:08 10.0.1.12:   File "/home/datascience/conda/pytorch2_0forgpuonpython3_9_vziqun/lib/python3.9/site-packages/transformers/trainer.py", line 2274, in _maybe_log_save_evaluate
2023-12-17 18:09:08 10.0.1.12:     self._save_checkpoint(model, trial, metrics=metrics)
2023-12-17 18:09:08 10.0.1.12:   File "/home/datascience/conda/pytorch2_0forgpuonpython3_9_vziqun/lib/python3.9/site-packages/transformers/trainer.py", line 2376, in _save_checkpoint
2023-12-17 18:09:08 10.0.1.12:     self.state.save_to_json(os.path.join(staging_output_dir, TRAINER_STATE_NAME))
2023-12-17 18:09:08 10.0.1.12:   File "/home/datascience/conda/pytorch2_0forgpuonpython3_9_vziqun/lib/python3.9/site-packages/transformers/trainer_callback.py", line 114, in save_to_json
2023-12-17 18:09:08 10.0.1.12:     with open(json_path, "w", encoding="utf-8") as f:
2023-12-17 18:09:08 10.0.1.12: FileNotFoundError: [Errno 2] No such file or directory: './qlora-out/tmp-checkpoint-1080/trainer_state.json'

Expected behavior

expect it to work.

About this issue

Original URL
State: open
Created 6 months ago
Comments: 26 (8 by maintainers)

Most upvoted comments

And please upgrade to 4.36.2

This problem occurs in training with multiple machines and multiple cards. Perhaps 4.36.2 did not solve this problem either, as 4.36.1 has already attempted to check for the presence of “stagg_output_dir” in “main_process”.

Yes, 4.36.2 also suffers from the same problem, even though #28078 has been updated.

ShaneTian on Dec 20, 2023

I just found that setting save_on_each_node=False in TrainingArguments works. See #28009

mayiran1999 on Feb 27, 2024

I see the error on 4.36.2 version as well, and I have a shared file system across each node. Using 2 nodes with 8 H100 gpus on each nodes.

imraviagrawal on Dec 22, 2023

This problem still exists 4.38.1 with multi node multi GPT training

voidmagic on Feb 26, 2024

@muellerzr This problem seems to be resolved on the latest version of transformers (4.37.2)

JohnGiorgi on Feb 9, 2024

And please upgrade to 4.36.2

This problem occurs in training with multiple machines and multiple cards. Perhaps 4.36.2 did not solve this problem either, as 4.36.1 has already attempted to check for the presence of “stagg_output_dir” in “main_process”.

Trangle on Dec 20, 2023

in trainer.py line 2555

        elif self.is_local_process_zero():
            # Clean up the remaining staging checkpoint folders on other nodes
            if staging_output_dir != output_dir and os.path.exists(staging_output_dir):
                shutil.rmtree(staging_output_dir)

should change to

        elif self.is_local_process_zero() if self.args.save_on_each_node else self.is_world_process_zero():
            # Clean up the remaining staging checkpoint folders on other nodes
            if staging_output_dir != output_dir and os.path.exists(staging_output_dir):
                shutil.rmtree(staging_output_dir, ignore_errors=True)

Trangle on Feb 29, 2024

And please upgrade to 4.36.2

muellerzr on Dec 18, 2023