wandb: [CLI] Sometimes FileNotFoundError randomly crashes runs
Description
I’m using Windows + PyTorch Lightning + Hydra. Sometimes runs just crash with:
FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Users\\X\\AppData\\Local\\Temp\\tmpgv0qht61wandb-media\\ai5vmd11.graph.json'
This happens after run is already succesfully initialized.
Full stack trace:
Training: 0it [00:00, ?it/s]
Training: 0%| | 0/1158 [00:00<?, ?it/s]
Epoch 0: 0%| | 0/1158 [00:00<?, ?it/s]
Traceback (most recent call last):
File ".\train.py", line 77, in main
metric = train(config)
File ".\train.py", line 56, in train
trainer.fit(model=model, datamodule=datamodule)
File "C:\Users\X\Anaconda3\envs\graphs\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 510, in fit
results = self.accelerator_backend.train()
File "C:\Users\X\Anaconda3\envs\graphs\lib\site-packages\pytorch_lightning\accelerators\accelerator.py", line 57, in train
return self.train_or_test()
File "C:\Users\X\Anaconda3\envs\graphs\lib\site-packages\pytorch_lightning\accelerators\accelerator.py", line 74, in train_or_test
results = self.trainer.train()
File "C:\Users\X\Anaconda3\envs\graphs\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 561, in train
self.train_loop.run_training_epoch()
File "C:\Users\X\Anaconda3\envs\graphs\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 550, in run_training_epoch
batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx)
File "C:\Users\X\Anaconda3\envs\graphs\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 718, in run_training_batch
self.optimizer_step(optimizer, opt_idx, batch_idx, train_step_and_backward_closure)
File "C:\Users\X\Anaconda3\envs\graphs\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 485, in optimizer_step
model_ref.optimizer_step(
File "C:\Users\X\Anaconda3\envs\graphs\lib\site-packages\pytorch_lightning\core\lightning.py", line 1298, in optimizer_step
optimizer.step(closure=optimizer_closure)
File "C:\Users\X\Anaconda3\envs\graphs\lib\site-packages\pytorch_lightning\core\optimizer.py", line 286, in step
self.__optimizer_step(*args, closure=closure, profiler_name=profiler_name, **kwargs)
File "C:\Users\X\Anaconda3\envs\graphs\lib\site-packages\pytorch_lightning\core\optimizer.py", line 144, in __optimizer_step
optimizer.step(closure=closure, *args, **kwargs)
File "C:\Users\X\Anaconda3\envs\graphs\lib\site-packages\torch\autograd\grad_mode.py", line 26, in decorate_context
return func(*args, **kwargs)
File "C:\Users\X\Anaconda3\envs\graphs\lib\site-packages\torch\optim\adam.py", line 66, in step
loss = closure()
File "C:\Users\X\Anaconda3\envs\graphs\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 708, in train_step_and_backward_closure
result = self.training_step_and_backward(
File "C:\Users\X\Anaconda3\envs\graphs\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 816, in training_step_and_backward
self.backward(result, optimizer, opt_idx)
File "C:\Users\X\Anaconda3\envs\graphs\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 842, in backward
result.closure_loss = self.trainer.accelerator_backend.backward(
File "C:\Users\X\Anaconda3\envs\graphs\lib\site-packages\pytorch_lightning\accelerators\accelerator.py", line 109, in backward
model.backward(closure_loss, optimizer, opt_idx, *args, **kwargs)
File "C:\Users\X\Anaconda3\envs\graphs\lib\site-packages\pytorch_lightning\core\lightning.py", line 1162, in backward
loss.backward(*args, **kwargs)
File "C:\Users\X\Anaconda3\envs\graphs\lib\site-packages\torch\tensor.py", line 221, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "C:\Users\X\Anaconda3\envs\graphs\lib\site-packages\torch\autograd\__init__.py", line 130, in backward
Variable._execution_engine.run_backward(
File "C:\Users\X\Anaconda3\envs\graphs\lib\site-packages\wandb\wandb_torch.py", line 398, in backward_hook
wandb.run.summary["graph_%i" % graph_idx] = graph
File "C:\Users\X\Anaconda3\envs\graphs\lib\site-packages\wandb\sdk\wandb_summary.py", line 57, in __setitem__
self.update({key: val})
File "C:\Users\X\Anaconda3\envs\graphs\lib\site-packages\wandb\sdk\wandb_summary.py", line 79, in update
self._update(record)
File "C:\Users\X\Anaconda3\envs\graphs\lib\site-packages\wandb\sdk\wandb_summary.py", line 133, in _update
self._update_callback(record)
File "C:\Users\X\Anaconda3\envs\graphs\lib\site-packages\wandb\sdk\wandb_run.py", line 670, in _summary_update_callback
self._backend.interface.publish_summary(summary_record)
File "C:\Users\X\Anaconda3\envs\graphs\lib\site-packages\wandb\sdk\interface\interface.py", line 538, in publish_summary
pb_summary_record = self._make_summary(summary_record)
File "C:\Users\X\Anaconda3\envs\graphs\lib\site-packages\wandb\sdk\interface\interface.py", line 333, in _make_summary
json_value = self._summary_encode(item.value, path_from_root)
File "C:\Users\X\Anaconda3\envs\graphs\lib\site-packages\wandb\sdk\interface\interface.py", line 296, in _summary_encode
data_types.val_to_json(
File "C:\Users\X\Anaconda3\envs\graphs\lib\site-packages\wandb\data_types.py", line 2948, in val_to_json
val.bind_to_run(run, key, namespace)
File "C:\Users\X\Anaconda3\envs\graphs\lib\site-packages\wandb\data_types.py", line 2510, in bind_to_run
util.json_dump_safer(data, codecs.open(tmp_path, "w", encoding="utf-8"))
File "C:\Users\X\Anaconda3\envs\graphs\lib\codecs.py", line 905, in open
file = builtins.open(filename, mode, buffering)
FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Users\\X\\AppData\\Local\\Temp\\tmpgv0qht61wandb-media\\ai5vmd11.graph.json'
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
Wandb features PyTorch Lightning WandbLogger.
How to reproduce Unfortunately this seems to happen completely randomly. Most of the times runs behave as expected, but on average 1 in a 30 runs crashes, which is bothersome as it crashes my whole multiruns. Happened many times with different setups. No easy way to reproduce.
Environment
- OS: Windows10
- Environment: conda
- Python Version: 3.8.6
Any idea what might cause it or how to prevent it?
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 7
- Comments: 24 (7 by maintainers)
Running into a similar issue when logging a
wandb.Tableobject and my run sometimes crashes and sometimes not.I’m using
pytorch-lightning,hydraandwandb(all latest versions) on a SLURM cluster.My running theory is that because I am using SLURM it is possible that the /tmp/ directory is cleared by another user/process of the node, and therefore the temporary folder where the
table.jsonwas supposed to be saved, now does not exist anymore.I can reproduce this error by deleting the temporary media folder created by wandb
tmpksnn7j_ywandb-media(in this case) while the code is running.Hi - I am having a similar error when I use multiprocessing to spin up multiple agents for wandb sweeps. The error looks something like
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmpqk8yfsrewandb-media/22jc9qcw.png'and it halts my runs.
Would be awesome to have this fixed. I have a strong preference for using the multiprocessing module over other approaches to parallelizing.
I am also running into this issue when using
hydra,wandb, and multiprocessing:An outline of
hydra_main.py:Note that this example is not even calling
wandb.init(), butwandbis imported at the top of the file. However, running:does not cause this exception.
Hi there, sorry to hear about these issues. Does anyone on this thread have a minimal code sample that reproduces this issue? We’re a little stumped here – it definitely looks like Python’s tempfile module (which is what we’re using under the hood) is failing here.
One suspicion: are any of these examples making use of multiprocessing? I think I’ve read that tempfiles can misbehave if used from a process other than the one that originally created tempfile/tempdir.
I am having a similar issue – after a while the process fails with the following error message, although uncomplete logs are still saved and synchronized:
I am not multiprocessing and I placed
import wandbright before callingwandb.init()as suggested but it didn’t prevent the error from happening.Same issue here. However, I am not running a multiprocess. The problem indeed occurs when logging the table. For some runs it works, for others it doesn’t. Totally random. I am using the most recent wandb version.