wandb: Error communicating with backend

Hi,

Sorry for creating this as a new issue, but technically it is. I added a comment to #1287 about this, but since it is a different problem, I thought it would be best to track it in a new issue. Copying the comment from the previous thread:

I am on version 0.10.4 of the client and I faced a similar error which I’m guessing is network error. It happened on one of the multiple similar runs.

From what I can tell, it seems something went wrong with login/init. Can the client circumvent this without crashing? Something simple I can think of is just allowing the user to increase the timeout, so the client just keeps polling the backend till it connects, rather than stopping the run. I’m not sure if this has problems I haven’t thought about.

Thanks!

Here’s the stack trace:

wandb: ERROR Error communicating with backend
Traceback (most recent call last):
  File "runners/mme/mme.py", line 64, in <module>
    config=args, reinit=True, project=project)
  File "/home/grad3/samarthm/bitbucket-misc/ssda_mme/utils/ioutils.py", line 398, in init
    wandb.init(*args, **kwargs)
  File "/home/grad3/samarthm/anaconda3/envs/pytorch3conda/lib/python3.7/site-packages/wandb/sdk/wandb_init.py", line 460, in init
    run = wi.init()
  File "/home/grad3/samarthm/anaconda3/envs/pytorch3conda/lib/python3.7/site-packages/wandb/sdk/wandb_init.py", line 378, in init
    raise UsageError(error_message)
wandb.errors.error.UsageError: Error communicating with backend

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 48 (14 by maintainers)

Most upvoted comments

Issue-Label Bot is automatically applying the label bug to this issue, with a confidence of 0.67. Please mark this comment with 👍 or 👎 to give our bot feedback!

Links: app homepage, dashboard and code for this bot.

Hmm I was wrong, I got this error again with 0.10.8

2020-11-07 01:04:34,387 INFO    MainThread:65675 [internal.py:wandb_internal():62] W&B internal server running at pid: 65675
2020-11-07 01:04:34,389 INFO    WriterThread:65675 [datastore.py:open_for_write():76] open: wandb/run-20201107_010319-17yifxe3/run-17yifxe3.wandb
2020-11-07 01:04:34,390 DEBUG   SenderThread:65675 [sender.py:send():89] send: header
2020-11-07 01:04:34,390 DEBUG   HandlerThread:65675 [handler.py:handle_request():54] handle_request: check_version
2020-11-07 01:04:34,391 DEBUG   HandlerThread:65675 [handler.py:handle_request():54] handle_request: shutdown
2020-11-07 01:04:34,392 DEBUG   SenderThread:65675 [sender.py:send():89] send: request
2020-11-07 01:04:34,392 DEBUG   SenderThread:65675 [sender.py:send_request():98] send_request: check_version
2020-11-07 01:04:34,392 INFO    HandlerThread:65675 [handler.py:finish():267] shutting down handler
2020-11-07 01:04:34,398 DEBUG   Thread-4  :65675 [connectionpool.py:_new_conn():939] Starting new HTTPS connection (1): pypi.org:443
2020-11-07 01:04:34,446 DEBUG   Thread-4  :65675 [connectionpool.py:_make_request():433] https://pypi.org:443 "GET /pypi/wandb/json HTTP/1.1" 200 51383
2020-11-07 01:04:34,457 INFO    SenderThread:65675 [sender.py:finish():608] shutting down sender
2020-11-07 01:04:35,392 INFO    WriterThread:65675 [datastore.py:close():257] close: wandb/run-20201107_010319-17yifxe3/run-17yifxe3.wandb
2020-11-07 01:04:35,393 INFO    MainThread:65675 [internal.py:handle_exit():137] Internal process exited
2020-11-07 01:03:19,899 INFO    MainThread:65105 [wandb_init.py:_log_setup():293] Logging user logs to wandb/run-20201107_010319-17yifxe3/logs/debug.log
2020-11-07 01:03:19,899 INFO    MainThread:65105 [wandb_init.py:_log_setup():294] Logging internal logs to wandb/run-20201107_010319-17yifxe3/logs/debug-internal.log
2020-11-07 01:03:19,899 INFO    MainThread:65105 [wandb_setup.py:_flush():69] setting env: {}
2020-11-07 01:03:19,899 INFO    MainThread:65105 [wandb_setup.py:_flush():69] setting user settings: {'save_code': False, 'email': '--------@gmail.com'}
2020-11-07 01:03:19,899 INFO    MainThread:65105 [wandb_setup.py:_flush():69] multiprocessing start_methods=fork,spawn,forkserver
2020-11-07 01:04:35,457 INFO    MainThread:65105 [wandb_init.py:teardown():154] tearing down wandb.init                                                                                                        

After some debugging, I found that

wandb.init(...) works, but when I use the pytorch_lightning.loggers.Wandb(...).experiment it fails, giving the same Error communicating with backend.

This might be related to some hard coded arguments in the pytorch_lightning wandb.init call.

import pytorch_lightning as pl
import wandb

db = pl.loggers.WandbLogger(name='new_test', project='test')
# Following line sometimes fails
print(db.experiment)

# Copy of https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pytorch_lightning/loggers/wandb.py#L125
# It also fails sometimes
wandb.init(name=db._name, dir=db._save_dir, project=db._project, anonymous=db._anonymous, reinit=True, id=db._id, resume='allow', **db._kwargs)

I ended up using the following script. One could inherit the logger, but I’ll just initialize _experiment variable in pl wandblogger.

import time

logger = pl.loggers.WandbLogger(name='new_test', project='test')
while True:
    try:
        logger._experiment = wandb.init(name=logger._name, project=logger._project)
        break
    except:
        print("Retrying")
        time.sleep(10)

....
# works fine

If it fails, it successfully initializes on the second try.

Update: it seems the default timeout value for ret = backend.interface.communicate_run(run, timeout=30) at wand_init.py is not long enough for certain compute environment. Increasing it to 200 solved the issue for me. As a feature request, hope the timeout value could be configurable!

Updates: even with updated rate with subscription, it did not solve the problem.

One thing that I noticed is that when I use sweep, the agent successfully creates a placeholder on wandb sweep and registers the agent itself during wandb.init so it communicates with the backend without a problem.

However, once it makes a experiment placeholder with a random name, it throws this Error communicating with backend.

Unfortunately, I had to switch to comet.ml since I can’t run experiments on clusters with wandb.

Downgrading to 0.10.8 as @hechmik suggested worked for me.

pip install --upgrade wandb==0.10.8


Sorry, it seems like it was failing randomly.

@aChang146 you can set environment variables in a couple ways:

  1. From your terminal
export WANDB_API_KEY=XXXXXXX

That will make this environment variable to any processes started from that terminal session.

  1. From within python
import os
os.environ["WANDB_API_KEY"] = "XXXXXXX"

If you do set the key in your python make sure not to commit that code into your source control as api keys should be treated like secrets and never checked into code or shared with anyone.

WANDB_START_METHOD=thread python train.py

Or directly in python with:

import os
os.environ["WANDB_START_METHOD"] = "thread"

I ran into this problem as well and the above solution did not help. So I went with @chrischoy’s solution as well.

Can confirm that the issue still persists. Here’s a google colab where I run into the issue. I also found that the error does not show up if I run wandb.init() at the beginning of the colab notebook, but does show up after I have run some significant amount of code above it.

https://drive.google.com/file/d/1ogzN4UTezknK6vlITzHet4JljNGaWqB6/view?usp=sharing

I’m making an assignment for a class I teach at Harvard, and thought it would be a good idea to introduce students to wandb for managing their codebases. But frankly students have been running into issues throughout assignments.

Hey @Arslan-Massod we believe we have a fix for the most common cause of this issue that we will be releasing first thing next week. You can try the branch with that fix now by running:

pip install --upgrade git+git://github.com/wandb/client.git@task/debug-init-wandb#egg=wandb

If running the above version does not work for you, please let us know and provide some example code and information about your environment so we can reproduce.

I ran into a similar issue. In my case, explicitly including protobuf in my requirements.txt solved the problem. (Why this was missing, I’m still not sure, but my python environment was in a weird state, some packages installed by root, some by the user.)

There were a few clues that the wandb backend process was not running on my system.

  • While there was a debug.log file (written by the client) there was no debug-internal.log file (written by the backend).
  • ps showed a single process that looked like /usr/bin/python3 -c from multiprocessing.semaphore_tracker import main;main(64) and apparently there should be two such processes.

When I attempted to run wandb from the command line I got a stacktrace that helped fix the issue.

$ wandb
Traceback (most recent call last):
  File "/usr/local/bin/wandb", line 5, in <module>
    from wandb.cli.cli import cli
  File "/home/build/.local/lib/python3.6/site-packages/wandb/__init__.py", line 37, in <module>
    from wandb import sdk as wandb_sdk
  File "/home/build/.local/lib/python3.6/site-packages/wandb/sdk/__init__.py", line 12, in <module>
    from .wandb_init import init  # noqa: F401
  File "/home/build/.local/lib/python3.6/site-packages/wandb/sdk/wandb_init.py", line 28, in <module>
    from .backend.backend import Backend
  File "/home/build/.local/lib/python3.6/site-packages/wandb/sdk/backend/backend.py", line 14, in <module>
    from ..interface import interface
  File "/home/build/.local/lib/python3.6/site-packages/wandb/sdk/interface/interface.py", line 18, in <module>
    from wandb.proto import wandb_internal_pb2  # type: ignore
  File "/home/build/.local/lib/python3.6/site-packages/wandb/proto/wandb_internal_pb2.py", line 5, in <module>
    from google.protobuf import descriptor as _descriptor
ModuleNotFoundError: No module named 'google.protobuf'

Running pip install --upgrade protobuf fixed the issue. So I am guessing that when the client tried to spawn the wandb process, it got this error as well. (But I’m really not familiar with this code base, so this is just a guess.)

I do see protobuf listed as a dependency using pipdeptree --packages wandb so I’m not sure why explicitly including it fixed this issue for me. Again, this might be fallout from the messy python environment I currently have. So I doubt if this same fix will work for the OP, but worth looking at the output of running wandb at the command line.