wandb: [CLI]: Can't find port file when using wandb.require("service")

Describe the bug

When running the code snippet below using WanDB + PyTorchLightning on SLURM. The code crashes randomly for most if not all runs.

logger = WandbLogger(project="MNIST")

Traceback (most recent call last):
  File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/sdk/wandb_init.py", line 996, in init
    wi.setup(kwargs)
  File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/sdk/wandb_init.py", line 133, in setup
    self._wl = wandb_setup.setup()
  File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/sdk/wandb_setup.py", line 318, in setup
    ret = _setup(settings=settings)
  File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/sdk/wandb_setup.py", line 313, in _setup
    wl = _WandbSetup(settings=settings)
  File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/sdk/wandb_setup.py", line 299, in __init__
    _WandbSetup._instance = _WandbSetup__WandbSetup(settings=settings, pid=pid)
  File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/sdk/wandb_setup.py", line 113, in __init__
    self._setup()
  File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/sdk/wandb_setup.py", line 240, in _setup
    self._setup_manager()
  File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/sdk/wandb_setup.py", line 271, in _setup_manager
    self._manager = wandb_manager._Manager(
  File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/sdk/wandb_manager.py", line 106, in __init__
    self._service.start()
  File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/sdk/service/service.py", line 104, in start
    self._launch_server()
  File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/sdk/service/service.py", line 100, in _launch_server
    assert ports_found
AssertionError
wandb: ERROR Abnormal program exit
Traceback (most recent call last):
  File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/sdk/wandb_init.py", line 996, in init
    wi.setup(kwargs)
  File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/sdk/wandb_init.py", line 133, in setup
    self._wl = wandb_setup.setup()
  File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/sdk/wandb_setup.py", line 318, in setup
    ret = _setup(settings=settings)
  File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/sdk/wandb_setup.py", line 313, in _setup
    wl = _WandbSetup(settings=settings)
  File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/sdk/wandb_setup.py", line 299, in __init__
    _WandbSetup._instance = _WandbSetup__WandbSetup(settings=settings, pid=pid)
  File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/sdk/wandb_setup.py", line 113, in __init__
    self._setup()
  File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/sdk/wandb_setup.py", line 240, in _setup
    self._setup_manager()
  File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/sdk/wandb_setup.py", line 271, in _setup_manager
    self._manager = wandb_manager._Manager(
  File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/sdk/wandb_manager.py", line 106, in __init__
    self._service.start()
  File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/sdk/service/service.py", line 104, in start
    self._launch_server()
  File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/sdk/service/service.py", line 100, in _launch_server
    assert ports_found
AssertionError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home-mscluster/mfokam/assa/scripts/pretrain_eval.py", line 140, in train_eval
    logger = WandbLogger(
  File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/pytorch_lightning/loggers/wandb.py", line 311, in __init__
    _ = self.experiment
  File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/pytorch_lightning/loggers/base.py", line 41, in experiment
    return get_experiment() or DummyExperiment()
  File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/pytorch_lightning/utilities/rank_zero.py", line 32, in wrapped_fn
    return fn(*args, **kwargs)
  File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/pytorch_lightning/loggers/base.py", line 39, in get_experiment
    return fn(self)
  File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/pytorch_lightning/loggers/wandb.py", line 357, in experiment
    self._experiment = wandb.init(**self._wandb_init)
  File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/sdk/wandb_init.py", line 1037, in init
    raise Exception("problem") from error_seen
Exception: problem
Traceback (most recent call last):
  File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/__main__.py", line 3, in <module>
    cli.cli(prog_name="python -m wandb")
  File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/cli/cli.py", line 96, in wrapper
    return func(*args, **kwargs)
  File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/cli/cli.py", line 285, in service
    server.serve()
  File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/sdk/service/server.py", line 128, in serve
    self._inform_used_ports(grpc_port=grpc_port, sock_port=sock_port)
  File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/sdk/service/server.py", line 65, in _inform_used_ports
    pf.write(self._port_fname)
  File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/sdk/service/port_file.py", line 25, in write
    f = tempfile.NamedTemporaryFile(prefix=bname, dir=dname, mode="w", delete=False)
  File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/tempfile.py", line 540, in NamedTemporaryFile
    (fd, name) = _mkstemp_inner(dir, prefix, suffix, flags, output_type)
  File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/tempfile.py", line 250, in _mkstemp_inner
    fd = _os.open(file, flags, 0o600)
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/mfokam/tmpel7x5eip/port-21750.txti_81fekt'

Additional Files

No response

Environment

WandB version: 0.12.20

OS: Ubuntu 18.04.6 LTS

Python version: 3.8

Versions of relevant libraries: PyTorch: 1.12.0 PyTorch Lightning: 1.6.4

Additional Context

The code seems to crash when executed on a very slow cluster node
Tried to reproduce the error locally but I can only get a similar error if I insert a breakpoint point on the tempfile.py method executed (see last 6 lines of the stack trace) and I wait. After a certain period (3 - 5 seconds), I get an error similar to what I have on SLURM.

About this issue

Original URL
State: open
Created 2 years ago
Reactions: 1
Comments: 28 (3 by maintainers)

Commits related to this issue

chore(sdk): add settings and debug for service startup issues (wait_for_ports) (#4749) — committed to wandb/wandb by raubitsj a year ago

Most upvoted comments

Hi, @anmolmann @ArnolFokam , I had the same problem when I ran the code on a very slow cluster node. I found out that the reason for the error is because of the magic number 30 in the code below https://github.com/wandb/wandb/blob/master/wandb/sdk/service/service.py#L41 def _wait_for_ports(self, fname: str, proc: subprocess.Popen = None) -> bool: time_max = time.time() + 30 When I increase the max waiting time from 30 to 300, it works.

+10

ShenghaiRong on Aug 4, 2022

Hi @kptkin ,

Thank you for this update, it will be very helpful. I am wondering how this variable can be passed as a setting to wandb.init? I tried the following but it doesn’t work
import wandb
from pytorch_lightning.loggers import WandbLogger
wandb_logger = WandbLogger(settings=wandb.Settings(_service_wait=300))
Also, would it be possible to update the docs?

@gsaltintas thanks for trying it out few comments:

settings with _ are private settings that we might remove in the future, hence we don’t officially document them, until we audit them and convince ourselves these settings are necessary, but once we move them to public, we will be sure to document them.
regarding your script, it seems that you are correct and we don’t pass this setting properly (it is a special case during the setup process of our service, we are working on a fix, to make sure the settings are resolved correctly), so I would suggest in the meantime, adding is as environment variable to the beginning of your script to make it work:

import os
import wandb
from pytorch_lightning.loggers import WandbLogger

os.environ["WANDB__SERVICE_WAIT"] = "300"

wandb_logger = WandbLogger()

kptkin on Jan 31, 2023

Hi all!

Thanks for reporting this issue. We are actively working on resolving it and would like to ask for your help.

If you could upgrade to the e latest version of wandb (as writing this message it would be 0.13.9) that would really helpful.
Also we introduce new environment variable (WANDB__SERVICE_WAIT) this will allow you increase the startup time from the command line instead of reaching and modifying the installed code. To use it just do the following: ```WANDB__SERVICE_WAIT=300 python your_script.py``
Another thing we added a debug flag that will print the timing information during startup to the tdout (if you have a way to collect stdout and share with us that would further help us narrow down the issue). You can do that as follows: _WANDB_STARTUP_DEBUG=true python your_script.py

Sorry that you are experiencing issue and hopefully we could resolve them soon.

kptkin on Jan 26, 2023

Hey, I’m getting the same AssertionError; also running on a cluster and we have hiccups from time to time as well.

I don’t care if logs might be only stored locally, or not synchronized right away, but this exception terminates my training all together.

I would strongly advice re-opening this issue until this is fixed properly! A run being terminated due to this is unacceptable, and possible reason enough for me to stop using wandb.

sehoffmann on Dec 13, 2022

Hi @nate-wandb

This error occurs when wandb.init is called. Therefore, the wandb folder is not yet created.

In my case, I can confirm that this error occurs when the cluster is very slow. Also, this error is related to wandb.require('service'). I say this because when I downgraded wandb to remove the automatic execution of that snippet on Pytorch Lightning, everything worked fine.

Thanks, Arnol Fokam.

ArnolFokam on Jul 13, 2022

@Adam-lxd , you’re right this is being caused by network slowness for sure. Please let us know if you see that your network speed is back to normal but you still encounter this issue.

Thanks, the main reason is my network, may the cluster administrator block the connection to wandb service. now my solution is training on the cluster with offline mode, then switching o my PC to upload log files.

Adam-lxd on Sep 1, 2022

when I upgrade pytorch-lightning==1.7.1 seeing the same problem. I tried to set os.environ["WANDB_MODE"] = "offline" and offline=True even the solution in https://github.com/wandb/wandb/issues/3911#issuecomment-1204961296. but when degrade pytorch-lightning==1.1.8 code can run without any problem. I guess it’s may be caused by the network and the difference of WandbLogger in pytorch_lightning.loggers

Adam-lxd on Aug 16, 2022

Hi, @anmolmann @ArnolFokam , I had the same problem when I ran the code on a very slow cluster node. I found out that the reason for the error is because of the magic number 30 in the code below https://github.com/wandb/wandb/blob/master/wandb/sdk/service/service.py#L41 def _wait_for_ports(self, fname: str, proc: subprocess.Popen = None) -> bool: time_max = time.time() + 30 When I increase the max waiting time from 30 to 300, it works.

I was facing the same issue and changing it to 300 worked for me.

koulanurag on Aug 10, 2022

@anmolmann,

It might be difficult to reproduce the error if you don’t possess a cluster node slow enough but I remarked that when I insert a breakpoint (running locally) somewhere inside the constructor called here and I wait long enough (5 - 8 seconds), I get a similar error.

Yes.
I did set offline=True after the error occurred but still encountered the same issue.

Hope it helps.

ArnolFokam on Jul 28, 2022