wandb: Intermittent socket timeouts

This happens rarely but we should likely catch the timeout.

"/Midgard/home/mrabadan/anaconda3/envs/pytorch/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
self.run()
File "/Midgard/home/mrabadan/anaconda3/envs/pytorch/lib/python3.7/multiprocessing/process.py", line 99, in run
self._target(*self._args, **self._kwargs)
File "scripts/train_mmnist.py", line 32, in train_run_worker
dir=run_dir)
File "/Midgard/home/mrabadan/anaconda3/envs/pytorch/lib/python3.7/site-packages/wandb/__init__.py", line 983, in init
_init_headless(run)
File "/Midgard/home/mrabadan/anaconda3/envs/pytorch/lib/python3.7/site-packages/wandb/__init__.py", line 239, in _init_headless
success, message = server.listen(30)
File "/Midgard/home/mrabadan/anaconda3/envs/pytorch/lib/python3.7/site-packages/wandb/wandb_socket.py", line 46, in listen
self.connect()
File "/Midgard/home/mrabadan/anaconda3/envs/pytorch/lib/python3.7/site-packages/wandb/wandb_socket.py", line 40, in connect
self.connection, addr = self.socket.accept()
File "/Midgard/home/mrabadan/anaconda3/envs/pytorch/lib/python3.7/socket.py", line 212, in accept
fd, addr = self._accept()
socket.timeout: timed out

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Reactions: 8
  • Comments: 26 (8 by maintainers)

Commits related to this issue

Most upvoted comments

Hi, just started using wandb and I am also getting this error when running remote jobs (qsub) and it is not rare at all. Around 1/3 of my jobs end like that. Not sure what is happening but in conjuction to other bugs (https://github.com/wandb/client/issues/785) seems like wandb is not a viable option for me.

wandb: Tracking run with wandb version 0.8.21
Traceback (most recent call last):
  File "train_network.py", line 18, in <module>
    wandb.init(project=args.project, name=args.experiment_name, config=vars(args))
  File "/rds/general/user/ef1015/home/anaconda3/envs/cuda10/lib/python3.7/site-packages/wandb/__init__.py", line 1075, in init
    _init_headless(run)
  File "/rds/general/user/ef1015/home/anaconda3/envs/cuda10/lib/python3.7/site-packages/wandb/__init__.py", line 277, in _init_headless
    success, _ = server.listen(30)
  File "/rds/general/user/ef1015/home/anaconda3/envs/cuda10/lib/python3.7/site-packages/wandb/wandb_socket.py", line 46, in listen
    self.connect()
  File "/rds/general/user/ef1015/home/anaconda3/envs/cuda10/lib/python3.7/site-packages/wandb/wandb_socket.py", line 40, in connect
    self.connection, addr = self.socket.accept()
  File "/rds/general/user/ef1015/home/anaconda3/envs/cuda10/lib/python3.7/socket.py", line 212, in accept
    fd, addr = self._accept()
socket.timeout: timed out

Issue-Label Bot is automatically applying the label bug to this issue, with a confidence of 0.99. Please mark this comment with πŸ‘ or πŸ‘Ž to give our bot feedback!

Links: app homepage, dashboard and code for this bot.

I got the exact same issue. I guess the issue is caused by the remote server spends time assigning tasks, in my case, that nearly require 5 minutes or even longer. And maybe this is the reason why wandb return time out error. Since I used the dry run mode of wandb, I just modify the 239 line of init.py under the site-packages/wandb as follows: # success, _ = server.listen(30) success = True Then, everything works well. It’s a temporary solution for those who want to use it now. @Yumin-Sun-00 Hoping the issue can be solved as soon as possible.

I got the same error today. This is already 2nd project I run with wandb.

I run the project both locally and on a remote server. Locally it is fine, remotely it has time out issue.

I think the problem could because the remote computer spends time assigning tasks, that nearly require 2 minutes or even longer. And maybe this is the reason why wandb return time out error. Could you help to solve this?

@vanpelt Hi, I think what my problem is. In my compute nodes, the internet connection is blocked, and my code includes the wandb.restore from the wandb server. I think this is my fault. Thanks for your concern.