wandb: Error communicating with backend
Hi,
Sorry for creating this as a new issue, but technically it is. I added a comment to #1287 about this, but since it is a different problem, I thought it would be best to track it in a new issue. Copying the comment from the previous thread:
I am on version 0.10.4 of the client and I faced a similar error which I’m guessing is network error. It happened on one of the multiple similar runs.
From what I can tell, it seems something went wrong with login/init. Can the client circumvent this without crashing? Something simple I can think of is just allowing the user to increase the timeout, so the client just keeps polling the backend till it connects, rather than stopping the run. I’m not sure if this has problems I haven’t thought about.
Thanks!
Here’s the stack trace:
wandb: ERROR Error communicating with backend
Traceback (most recent call last):
File "runners/mme/mme.py", line 64, in <module>
config=args, reinit=True, project=project)
File "/home/grad3/samarthm/bitbucket-misc/ssda_mme/utils/ioutils.py", line 398, in init
wandb.init(*args, **kwargs)
File "/home/grad3/samarthm/anaconda3/envs/pytorch3conda/lib/python3.7/site-packages/wandb/sdk/wandb_init.py", line 460, in init
run = wi.init()
File "/home/grad3/samarthm/anaconda3/envs/pytorch3conda/lib/python3.7/site-packages/wandb/sdk/wandb_init.py", line 378, in init
raise UsageError(error_message)
wandb.errors.error.UsageError: Error communicating with backend
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 48 (14 by maintainers)
Issue-Label Bot is automatically applying the label
bugto this issue, with a confidence of 0.67. Please mark this comment with 👍 or 👎 to give our bot feedback!Links: app homepage, dashboard and code for this bot.
Hmm I was wrong, I got this error again with 0.10.8
After some debugging, I found that
wandb.init(...)works, but when I use thepytorch_lightning.loggers.Wandb(...).experimentit fails, giving the sameError communicating with backend.This might be related to some hard coded arguments in the
pytorch_lightningwandb.initcall.I ended up using the following script. One could inherit the logger, but I’ll just initialize
_experimentvariable in pl wandblogger.If it fails, it successfully initializes on the second try.
Update: it seems the default
timeoutvalue forret = backend.interface.communicate_run(run, timeout=30)atwand_init.pyis not long enough for certain compute environment. Increasing it to200solved the issue for me. As a feature request, hope the timeout value could be configurable!Updates: even with updated rate with subscription, it did not solve the problem.
One thing that I noticed is that when I use sweep, the agent successfully creates a placeholder on wandb sweep and registers the agent itself during
wandb.initso it communicates with the backend without a problem.However, once it makes a experiment placeholder with a random name, it throws this
Error communicating with backend.Unfortunately, I had to switch to comet.ml since I can’t run experiments on clusters with wandb.
Downgrading to 0.10.8 as @hechmik suggested worked for me.
pip install --upgrade wandb==0.10.8Sorry, it seems like it was failing randomly.
@aChang146 you can set environment variables in a couple ways:
That will make this environment variable to any processes started from that terminal session.
If you do set the key in your python make sure not to commit that code into your source control as api keys should be treated like secrets and never checked into code or shared with anyone.
I ran into this problem as well and the above solution did not help. So I went with @chrischoy’s solution as well.
Can confirm that the issue still persists. Here’s a google colab where I run into the issue. I also found that the error does not show up if I run wandb.init() at the beginning of the colab notebook, but does show up after I have run some significant amount of code above it.
https://drive.google.com/file/d/1ogzN4UTezknK6vlITzHet4JljNGaWqB6/view?usp=sharing
I’m making an assignment for a class I teach at Harvard, and thought it would be a good idea to introduce students to wandb for managing their codebases. But frankly students have been running into issues throughout assignments.
Hey @Arslan-Massod we believe we have a fix for the most common cause of this issue that we will be releasing first thing next week. You can try the branch with that fix now by running:
If running the above version does not work for you, please let us know and provide some example code and information about your environment so we can reproduce.
I ran into a similar issue. In my case, explicitly including
protobufin my requirements.txt solved the problem. (Why this was missing, I’m still not sure, but my python environment was in a weird state, some packages installed by root, some by the user.)There were a few clues that the
wandbbackend process was not running on my system.debug.logfile (written by the client) there was nodebug-internal.logfile (written by the backend).psshowed a single process that looked like/usr/bin/python3 -c from multiprocessing.semaphore_tracker import main;main(64)and apparently there should be two such processes.When I attempted to run
wandbfrom the command line I got a stacktrace that helped fix the issue.Running
pip install --upgrade protobuffixed the issue. So I am guessing that when the client tried to spawn thewandbprocess, it got this error as well. (But I’m really not familiar with this code base, so this is just a guess.)I do see
protobuflisted as a dependency usingpipdeptree --packages wandbso I’m not sure why explicitly including it fixed this issue for me. Again, this might be fallout from the messy python environment I currently have. So I doubt if this same fix will work for the OP, but worth looking at the output of runningwandbat the command line.