transformers: Save model checkpoint error when multi-gpu training still happens on 4.36.1
System Info
platform: linux python: 3.9 transformers: 4.36.1 running on two A10.2
Who can help?
No response
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, …) - My own task or dataset (give details below)
Reproduction
I saw the release notes of 4.36.1 says this error already fixed, however, it still repeats after I install the latest version when I am running on a two A10.2 machine.
Traceback (most recent call last):
2023-12-17 18:09:08 10.0.1.12: File "/home/datascience/conda/pytorch2_0forgpuonpython3_9_vziqun/lib/python3.9/runpy.py", line 197, in _run_module_as_main
2023-12-17 18:09:08 10.0.1.12: return _run_code(code, main_globals, None,
2023-12-17 18:09:08 10.0.1.12: File "/home/datascience/conda/pytorch2_0forgpuonpython3_9_vziqun/lib/python3.9/runpy.py", line 87, in _run_code
2023-12-17 18:09:08 10.0.1.12: exec(code, run_globals)
2023-12-17 18:09:08 10.0.1.12: File "/home/datascience/decompressed_artifact/code/src/axolotl/cli/train.py", line 38, in <module>
2023-12-17 18:09:08 10.0.1.12: fire.Fire(do_cli)
2023-12-17 18:09:08 10.0.1.12: File "/home/datascience/conda/pytorch2_0forgpuonpython3_9_vziqun/lib/python3.9/site-packages/fire/core.py", line 141, in Fire
2023-12-17 18:09:08 10.0.1.12: component_trace = _Fire(component, args, parsed_flag_args, context, name)
2023-12-17 18:09:08 10.0.1.12: File "/home/datascience/conda/pytorch2_0forgpuonpython3_9_vziqun/lib/python3.9/site-packages/fire/core.py", line 475, in _Fire
2023-12-17 18:09:08 10.0.1.12: component, remaining_args = _CallAndUpdateTrace(
2023-12-17 18:09:08 10.0.1.12: File "/home/datascience/conda/pytorch2_0forgpuonpython3_9_vziqun/lib/python3.9/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
2023-12-17 18:09:08 10.0.1.12: component = fn(*varargs, **kwargs)
2023-12-17 18:09:08 10.0.1.12: File "/home/datascience/decompressed_artifact/code/src/axolotl/cli/train.py", line 34, in do_cli
2023-12-17 18:09:08 10.0.1.12: train(cfg=parsed_cfg, cli_args=parsed_cli_args, dataset_meta=dataset_meta)
2023-12-17 18:09:08 10.0.1.12: File "/home/datascience/decompressed_artifact/code/src/axolotl/train.py", line 126, in train
2023-12-17 18:09:08 10.0.1.12: trainer.train(resume_from_checkpoint=resume_from_checkpoint)
2023-12-17 18:09:08 10.0.1.12: File "/home/datascience/conda/pytorch2_0forgpuonpython3_9_vziqun/lib/python3.9/site-packages/transformers/trainer.py", line 1537, in train
2023-12-17 18:09:08 10.0.1.12: return inner_training_loop(
2023-12-17 18:09:08 10.0.1.12: File "/home/datascience/conda/pytorch2_0forgpuonpython3_9_vziqun/lib/python3.9/site-packages/transformers/trainer.py", line 1929, in _inner_training_loop
2023-12-17 18:09:08 10.0.1.12: self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
2023-12-17 18:09:08 10.0.1.12: File "/home/datascience/conda/pytorch2_0forgpuonpython3_9_vziqun/lib/python3.9/site-packages/transformers/trainer.py", line 2274, in _maybe_log_save_evaluate
2023-12-17 18:09:08 10.0.1.12: self._save_checkpoint(model, trial, metrics=metrics)
2023-12-17 18:09:08 10.0.1.12: File "/home/datascience/conda/pytorch2_0forgpuonpython3_9_vziqun/lib/python3.9/site-packages/transformers/trainer.py", line 2376, in _save_checkpoint
2023-12-17 18:09:08 10.0.1.12: self.state.save_to_json(os.path.join(staging_output_dir, TRAINER_STATE_NAME))
2023-12-17 18:09:08 10.0.1.12: File "/home/datascience/conda/pytorch2_0forgpuonpython3_9_vziqun/lib/python3.9/site-packages/transformers/trainer_callback.py", line 114, in save_to_json
2023-12-17 18:09:08 10.0.1.12: with open(json_path, "w", encoding="utf-8") as f:
2023-12-17 18:09:08 10.0.1.12: FileNotFoundError: [Errno 2] No such file or directory: './qlora-out/tmp-checkpoint-1080/trainer_state.json'
Expected behavior
expect it to work.
About this issue
- Original URL
- State: open
- Created 6 months ago
- Comments: 26 (8 by maintainers)
Yes, 4.36.2 also suffers from the same problem, even though #28078 has been updated.
I just found that setting
save_on_each_node=Falsein TrainingArguments works. See #28009I see the error on 4.36.2 version as well, and I have a shared file system across each node. Using 2 nodes with 8 H100 gpus on each nodes.
This problem still exists 4.38.1 with multi node multi GPT training
@muellerzr This problem seems to be resolved on the latest version of transformers (
4.37.2)This problem occurs in training with multiple machines and multiple cards. Perhaps 4.36.2 did not solve this problem either, as 4.36.1 has already attempted to check for the presence of “stagg_output_dir” in “main_process”.
in trainer.py line 2555
should change to
And please upgrade to 4.36.2