wandb: [CLI]: Can't find port file when using wandb.require("service")
Describe the bug
When running the code snippet below using WanDB + PyTorchLightning on SLURM. The code crashes randomly for most if not all runs.
logger = WandbLogger(project="MNIST")
Traceback (most recent call last):
File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/sdk/wandb_init.py", line 996, in init
wi.setup(kwargs)
File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/sdk/wandb_init.py", line 133, in setup
self._wl = wandb_setup.setup()
File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/sdk/wandb_setup.py", line 318, in setup
ret = _setup(settings=settings)
File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/sdk/wandb_setup.py", line 313, in _setup
wl = _WandbSetup(settings=settings)
File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/sdk/wandb_setup.py", line 299, in __init__
_WandbSetup._instance = _WandbSetup__WandbSetup(settings=settings, pid=pid)
File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/sdk/wandb_setup.py", line 113, in __init__
self._setup()
File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/sdk/wandb_setup.py", line 240, in _setup
self._setup_manager()
File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/sdk/wandb_setup.py", line 271, in _setup_manager
self._manager = wandb_manager._Manager(
File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/sdk/wandb_manager.py", line 106, in __init__
self._service.start()
File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/sdk/service/service.py", line 104, in start
self._launch_server()
File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/sdk/service/service.py", line 100, in _launch_server
assert ports_found
AssertionError
wandb: ERROR Abnormal program exit
Traceback (most recent call last):
File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/sdk/wandb_init.py", line 996, in init
wi.setup(kwargs)
File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/sdk/wandb_init.py", line 133, in setup
self._wl = wandb_setup.setup()
File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/sdk/wandb_setup.py", line 318, in setup
ret = _setup(settings=settings)
File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/sdk/wandb_setup.py", line 313, in _setup
wl = _WandbSetup(settings=settings)
File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/sdk/wandb_setup.py", line 299, in __init__
_WandbSetup._instance = _WandbSetup__WandbSetup(settings=settings, pid=pid)
File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/sdk/wandb_setup.py", line 113, in __init__
self._setup()
File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/sdk/wandb_setup.py", line 240, in _setup
self._setup_manager()
File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/sdk/wandb_setup.py", line 271, in _setup_manager
self._manager = wandb_manager._Manager(
File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/sdk/wandb_manager.py", line 106, in __init__
self._service.start()
File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/sdk/service/service.py", line 104, in start
self._launch_server()
File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/sdk/service/service.py", line 100, in _launch_server
assert ports_found
AssertionError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home-mscluster/mfokam/assa/scripts/pretrain_eval.py", line 140, in train_eval
logger = WandbLogger(
File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/pytorch_lightning/loggers/wandb.py", line 311, in __init__
_ = self.experiment
File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/pytorch_lightning/loggers/base.py", line 41, in experiment
return get_experiment() or DummyExperiment()
File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/pytorch_lightning/utilities/rank_zero.py", line 32, in wrapped_fn
return fn(*args, **kwargs)
File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/pytorch_lightning/loggers/base.py", line 39, in get_experiment
return fn(self)
File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/pytorch_lightning/loggers/wandb.py", line 357, in experiment
self._experiment = wandb.init(**self._wandb_init)
File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/sdk/wandb_init.py", line 1037, in init
raise Exception("problem") from error_seen
Exception: problem
Traceback (most recent call last):
File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/__main__.py", line 3, in <module>
cli.cli(prog_name="python -m wandb")
File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/click/core.py", line 1130, in __call__
return self.main(*args, **kwargs)
File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/click/core.py", line 1055, in main
rv = self.invoke(ctx)
File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/click/core.py", line 1657, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/click/core.py", line 760, in invoke
return __callback(*args, **kwargs)
File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/cli/cli.py", line 96, in wrapper
return func(*args, **kwargs)
File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/cli/cli.py", line 285, in service
server.serve()
File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/sdk/service/server.py", line 128, in serve
self._inform_used_ports(grpc_port=grpc_port, sock_port=sock_port)
File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/sdk/service/server.py", line 65, in _inform_used_ports
pf.write(self._port_fname)
File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/site-packages/wandb/sdk/service/port_file.py", line 25, in write
f = tempfile.NamedTemporaryFile(prefix=bname, dir=dname, mode="w", delete=False)
File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/tempfile.py", line 540, in NamedTemporaryFile
(fd, name) = _mkstemp_inner(dir, prefix, suffix, flags, output_type)
File "/home-mscluster/mfokam/anaconda3/envs/assa/lib/python3.8/tempfile.py", line 250, in _mkstemp_inner
fd = _os.open(file, flags, 0o600)
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/mfokam/tmpel7x5eip/port-21750.txti_81fekt'
Additional Files
No response
Environment
WandB version: 0.12.20
OS: Ubuntu 18.04.6 LTS
Python version: 3.8
Versions of relevant libraries: PyTorch: 1.12.0 PyTorch Lightning: 1.6.4
Additional Context
- The code seems to crash when executed on a very slow cluster node
- Tried to reproduce the error locally but I can only get a similar error if I insert a breakpoint point on the tempfile.py method executed (see last 6 lines of the stack trace) and I wait. After a certain period (3 - 5 seconds), I get an error similar to what I have on SLURM.
About this issue
- Original URL
- State: open
- Created 2 years ago
- Reactions: 1
- Comments: 28 (3 by maintainers)
Commits related to this issue
- chore(sdk): add settings and debug for service startup issues (wait_for_ports) (#4749) — committed to wandb/wandb by raubitsj a year ago
Hi, @anmolmann @ArnolFokam , I had the same problem when I ran the code on a very slow cluster node. I found out that the reason for the error is because of the magic number 30 in the code below https://github.com/wandb/wandb/blob/master/wandb/sdk/service/service.py#L41
def _wait_for_ports(self, fname: str, proc: subprocess.Popen = None) -> bool:time_max = time.time() + 30When I increase the max waiting time from 30 to 300, it works.@gsaltintas thanks for trying it out few comments:
Hi all!
Thanks for reporting this issue. We are actively working on resolving it and would like to ask for your help.
WANDB__SERVICE_WAIT) this will allow you increase the startup time from the command line instead of reaching and modifying the installed code. To use it just do the following: ```WANDB__SERVICE_WAIT=300 python your_script.py``_WANDB_STARTUP_DEBUG=true python your_script.pySorry that you are experiencing issue and hopefully we could resolve them soon.
Hey, I’m getting the same AssertionError; also running on a cluster and we have hiccups from time to time as well.
I don’t care if logs might be only stored locally, or not synchronized right away, but this exception terminates my training all together.
I would strongly advice re-opening this issue until this is fixed properly! A run being terminated due to this is unacceptable, and possible reason enough for me to stop using wandb.
Hi @nate-wandb
This error occurs when
wandb.initis called. Therefore, the wandb folder is not yet created.In my case, I can confirm that this error occurs when the cluster is very slow. Also, this error is related to
wandb.require('service'). I say this because when I downgraded wandb to remove the automatic execution of that snippet on Pytorch Lightning, everything worked fine.Thanks, Arnol Fokam.
Thanks, the main reason is my network, may the cluster administrator block the connection to wandb service. now my solution is training on the cluster with offline mode, then switching o my PC to upload log files.
when I upgrade pytorch-lightning==1.7.1 seeing the same problem. I tried to set
os.environ["WANDB_MODE"] = "offline"andoffline=Trueeven the solution in https://github.com/wandb/wandb/issues/3911#issuecomment-1204961296. but when degrade pytorch-lightning==1.1.8 code can run without any problem. I guess it’s may be caused by the network and the difference of WandbLogger in pytorch_lightning.loggersI was facing the same issue and changing it to 300 worked for me.
@anmolmann,
It might be difficult to reproduce the error if you don’t possess a cluster node slow enough but I remarked that when I insert a breakpoint (running locally) somewhere inside the constructor called here and I wait long enough (5 - 8 seconds), I get a similar error.
offline=Trueafter the error occurred but still encountered the same issue.Hope it helps.