wandb: [Q] wandb stream ID error

I tried running using wandb with optuna. Most of the times it works but some times the following error occurs and this code runs but the results are not sent to the wandb website (only the system information like CPU usage and so on gets sent)

Exception in thread SockSrvRdThr:...
Traceback (most recent call last):
  File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/home/connectome/dyhan316/.conda/envs/VAE_3DCNN_older_MONAI/lib/python3.8/site-packages/wandb/sdk/service/server_sock.py", line 112, in run
    shandler(sreq)
  File "/home/connectome/dyhan316/.conda/envs/VAE_3DCNN_older_MONAI/lib/python3.8/site-packages/wandb/sdk/service/server_sock.py", line 174, in server_record_publish
    iface = self._mux.get_stream(stream_id).interface
  File "/home/connectome/dyhan316/.conda/envs/VAE_3DCNN_older_MONAI/lib/python3.8/site-packages/wandb/sdk/service/streams.py", line 206, in get_stream
    stream = self._streams[stream_id]
KeyError: 'zyeausrv'

given that this is caused by wandb I was wondering if this was a wandb issue!

(I tried as https://github.com/wandb/wandb/issues/3223#issuecomment-1032820724 says, by setting os.environ["WANDB_START_METHOD"]="thread" but the same problem occurred)

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Reactions: 8
  • Comments: 32 (11 by maintainers)

Most upvoted comments

Artsiom Skarakhod commented: Hi guys! Could you see if settins this env variable will solve this concern?

WANDB_DISABLE_SERVICE=true

Running into the same issue.

I have tried the following but the issue persists.

(I tried as https://github.com/wandb/wandb/issues/3223#issuecomment-1032820724 says, by setting os.environ[“WANDB_START_METHOD”]=“thread” but the same problem occurred)

Here’s the setup I am using.

  1. First, clone the Diffusers repo: git clone https://github.com/huggingface/diffusers.

  2. Then head to examples folder: cd examples/text_to_image.

  3. Install the dependencies.

  4. Now, run:

     export MODEL_NAME="CompVis/stable-diffusion-v1-4"
     export DATASET_NAME="lambdalabs/pokemon-blip-captions"
    
    
     accelerate launch --mixed_precision="fp16" train_text_to_image_lora.py \
       --pretrained_model_name_or_path=$MODEL_NAME \
       --dataset_name=$DATASET_NAME --caption_column="text" \
       --resolution=512 --random_flip \
       --train_batch_size=1 \
       --max_train_steps=5 \
       --learning_rate=1e-04 --lr_scheduler="constant" --lr_warmup_steps=0 \
       --seed=42 \
       --validation_prompt="cute creature" \
       --report_to="wandb" \
       --output_dir="sd-pokemon-model-lora" && sudo shutdown now
    

A GPU like V100 should be sufficient to reproduce the bug. 

My problem is solved. Here’s what I found out.

My script was being killed in the background without letting wandb finishing the run. Once I ensure that my script was not existing the process unexpectedly, everything ran smoothly.