accelerate: accelerator.end_training() is generating exception when wandb is being used as tracker
System Info
- `Accelerate` version: 0.15.0
- Platform: macOS-13.1-arm64-i386-64bit
- Python version: 3.9.15
- Numpy version: 1.24.0
- PyTorch version (GPU?): 1.13.1 (False)
- `Accelerate` default config:
- compute_environment: LOCAL_MACHINE
- distributed_type: MPS
- mixed_precision: bf16
- use_cpu: False
- dynamo_backend: NO
- num_processes: 1
- machine_rank: 0
- num_machines: 1
- gpu_ids: None
- main_process_ip: None
- main_process_port: None
- rdzv_backend: static
- same_network: True
- main_training_function: main
- deepspeed_config: {}
- fsdp_config: {}
- megatron_lm_config: {}
- downcast_bf16: no
- tpu_name: None
- tpu_zone: None
- command_file: None
- commands: None
Information
- The official example scripts
- My own modified scripts
Tasks
- One of the scripts in the examples/ folder of Accelerate or an officially supported
no_trainerscript in theexamplesfolder of thetransformersrepo (such asrun_no_trainer_glue.py) - My own task or dataset (give details below)
Reproduction
I am initiating my accelerator tracker in this way
if args.with_tracking:
experiment_config = vars(args)
experiment_config["lr_scheduler_type"] = experiment_config["lr_scheduler_type"]
wandb.login(key=os.environ.get("WANDB_API_KEY"))
accelerator.init_trackers(
project_name=os.environ.get('WANDB_PROJECT_NAME'),
config=experiment_config,
init_kwargs={
"wandb": {
"job_type": "train",
"entity": os.environ.get('WANDB_ENTITY_NAME'),
"name": get_training_job_name()
}
}
)
and finishing my experiment in this way
if args.with_tracking:
accelerator.end_training()
It runs the complete training successfully and also the wandb run finishes but at the end it throws the following exception.
It throws the below exception
Exception in thread SockSrvRdThr:
Traceback (most recent call last):
File "/Users/samarpandutta/miniforge3/envs/banjo-accelerate-demo/lib/python3.9/threading.py", line 980, in _bootstrap_inner
self.run()
File "/Users/samarpandutta/miniforge3/envs/banjo-accelerate-demo/lib/python3.9/site-packages/wandb/sdk/service/server_sock.py", line 112, in run
shandler(sreq)
File "/Users/samarpandutta/miniforge3/envs/banjo-accelerate-demo/lib/python3.9/site-packages/wandb/sdk/service/server_sock.py", line 173, in server_record_publish
iface = self._mux.get_stream(stream_id).interface
File "/Users/samarpandutta/miniforge3/envs/banjo-accelerate-demo/lib/python3.9/site-packages/wandb/sdk/service/streams.py", line 199, in get_stream
stream = self._streams[stream_id]
KeyError: '3lxi4eq2'
where the key 3lxi4eq2 is actually the wandb run_id
Expected behavior
Exception should not be thrown at `accelerator.end_training()`
About this issue
- Original URL
- State: open
- Created 2 years ago
- Reactions: 5
- Comments: 15 (1 by maintainers)
We’ve reached out to the W&B folks, we should have a solution soon!
Hi,
I meet the same problem when running “run_glue_no_trainer.py” script.
Here is my script.
The version of accelerate is 0.15.0. The version of wandb is 0.13.2.