accelerate: accelerator.end_training() is generating exception when wandb is being used as tracker

System Info

- `Accelerate` version: 0.15.0
- Platform: macOS-13.1-arm64-i386-64bit
- Python version: 3.9.15
- Numpy version: 1.24.0
- PyTorch version (GPU?): 1.13.1 (False)
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: MPS
        - mixed_precision: bf16
        - use_cpu: False
        - dynamo_backend: NO
        - num_processes: 1
        - machine_rank: 0
        - num_machines: 1
        - gpu_ids: None
        - main_process_ip: None
        - main_process_port: None
        - rdzv_backend: static
        - same_network: True
        - main_training_function: main
        - deepspeed_config: {}
        - fsdp_config: {}
        - megatron_lm_config: {}
        - downcast_bf16: no
        - tpu_name: None
        - tpu_zone: None
        - command_file: None
        - commands: None

Information

The official example scripts
My own modified scripts

Tasks

One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
My own task or dataset (give details below)

Reproduction

I am initiating my accelerator tracker in this way

    if args.with_tracking:
        experiment_config = vars(args)
        experiment_config["lr_scheduler_type"] = experiment_config["lr_scheduler_type"]
        wandb.login(key=os.environ.get("WANDB_API_KEY"))
        accelerator.init_trackers(
            project_name=os.environ.get('WANDB_PROJECT_NAME'),
            config=experiment_config,
            init_kwargs={
                "wandb": {
                    "job_type": "train",
                    "entity": os.environ.get('WANDB_ENTITY_NAME'),
                    "name": get_training_job_name()
                }
            }
        )

and finishing my experiment in this way

    if args.with_tracking:
        accelerator.end_training()

It runs the complete training successfully and also the wandb run finishes but at the end it throws the following exception.

It throws the below exception

Exception in thread SockSrvRdThr:
Traceback (most recent call last):
File "/Users/samarpandutta/miniforge3/envs/banjo-accelerate-demo/lib/python3.9/threading.py", line 980, in _bootstrap_inner
self.run()
File "/Users/samarpandutta/miniforge3/envs/banjo-accelerate-demo/lib/python3.9/site-packages/wandb/sdk/service/server_sock.py", line 112, in run
shandler(sreq)
File "/Users/samarpandutta/miniforge3/envs/banjo-accelerate-demo/lib/python3.9/site-packages/wandb/sdk/service/server_sock.py", line 173, in server_record_publish
iface = self._mux.get_stream(stream_id).interface
File "/Users/samarpandutta/miniforge3/envs/banjo-accelerate-demo/lib/python3.9/site-packages/wandb/sdk/service/streams.py", line 199, in get_stream
stream = self._streams[stream_id]
KeyError: '3lxi4eq2'

where the key 3lxi4eq2 is actually the wandb run_id

Expected behavior

Exception should not be thrown at `accelerator.end_training()`

About this issue

Original URL
State: open
Created 2 years ago
Reactions: 5
Comments: 15 (1 by maintainers)

Most upvoted comments

We’ve reached out to the W&B folks, we should have a solution soon!

muellerzr on Jan 18, 2023

Hi,

I meet the same problem when running “run_glue_no_trainer.py” script.

Here is my script.

export WANDB_API_KEY="xxxx"

accelerate launch run_glue_no_trainer.py \
--model_name_or_path bert-base-cased \
--task_name sst2 \
--max_length 128 \
--per_device_train_batch_size 32 \
--learning_rate 2e-5 \
--num_train_epochs 1 \
--output_dir ../checkpoint/sst2 \
--with_tracking \
--report_to wandb

The version of accelerate is 0.15.0. The version of wandb is 0.13.2.

yjw1029 on Jan 7, 2023