wandb: [CLI] Sometimes FileNotFoundError randomly crashes runs

Description I’m using Windows + PyTorch Lightning + Hydra. Sometimes runs just crash with: FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Users\\X\\AppData\\Local\\Temp\\tmpgv0qht61wandb-media\\ai5vmd11.graph.json' This happens after run is already succesfully initialized. Full stack trace:

Training: 0it [00:00, ?it/s]
Training:   0%|                                                                                                                                                   | 0/1158 [00:00<?, ?it/s]
Epoch 0:   0%|                                                                                                                                                    | 0/1158 [00:00<?, ?it/s]
Traceback (most recent call last):
  File ".\train.py", line 77, in main
    metric = train(config)
  File ".\train.py", line 56, in train
    trainer.fit(model=model, datamodule=datamodule)
  File "C:\Users\X\Anaconda3\envs\graphs\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 510, in fit
    results = self.accelerator_backend.train()
  File "C:\Users\X\Anaconda3\envs\graphs\lib\site-packages\pytorch_lightning\accelerators\accelerator.py", line 57, in train
    return self.train_or_test()
  File "C:\Users\X\Anaconda3\envs\graphs\lib\site-packages\pytorch_lightning\accelerators\accelerator.py", line 74, in train_or_test
    results = self.trainer.train()
  File "C:\Users\X\Anaconda3\envs\graphs\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 561, in train
    self.train_loop.run_training_epoch()
  File "C:\Users\X\Anaconda3\envs\graphs\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 550, in run_training_epoch
    batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx)
  File "C:\Users\X\Anaconda3\envs\graphs\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 718, in run_training_batch
    self.optimizer_step(optimizer, opt_idx, batch_idx, train_step_and_backward_closure)
  File "C:\Users\X\Anaconda3\envs\graphs\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 485, in optimizer_step
    model_ref.optimizer_step(
  File "C:\Users\X\Anaconda3\envs\graphs\lib\site-packages\pytorch_lightning\core\lightning.py", line 1298, in optimizer_step
    optimizer.step(closure=optimizer_closure)
  File "C:\Users\X\Anaconda3\envs\graphs\lib\site-packages\pytorch_lightning\core\optimizer.py", line 286, in step
    self.__optimizer_step(*args, closure=closure, profiler_name=profiler_name, **kwargs)
  File "C:\Users\X\Anaconda3\envs\graphs\lib\site-packages\pytorch_lightning\core\optimizer.py", line 144, in __optimizer_step
    optimizer.step(closure=closure, *args, **kwargs)
  File "C:\Users\X\Anaconda3\envs\graphs\lib\site-packages\torch\autograd\grad_mode.py", line 26, in decorate_context
    return func(*args, **kwargs)
  File "C:\Users\X\Anaconda3\envs\graphs\lib\site-packages\torch\optim\adam.py", line 66, in step
    loss = closure()
  File "C:\Users\X\Anaconda3\envs\graphs\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 708, in train_step_and_backward_closure
    result = self.training_step_and_backward(
  File "C:\Users\X\Anaconda3\envs\graphs\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 816, in training_step_and_backward
    self.backward(result, optimizer, opt_idx)
  File "C:\Users\X\Anaconda3\envs\graphs\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 842, in backward
    result.closure_loss = self.trainer.accelerator_backend.backward(
  File "C:\Users\X\Anaconda3\envs\graphs\lib\site-packages\pytorch_lightning\accelerators\accelerator.py", line 109, in backward
    model.backward(closure_loss, optimizer, opt_idx, *args, **kwargs)
  File "C:\Users\X\Anaconda3\envs\graphs\lib\site-packages\pytorch_lightning\core\lightning.py", line 1162, in backward
    loss.backward(*args, **kwargs)
  File "C:\Users\X\Anaconda3\envs\graphs\lib\site-packages\torch\tensor.py", line 221, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "C:\Users\X\Anaconda3\envs\graphs\lib\site-packages\torch\autograd\__init__.py", line 130, in backward
    Variable._execution_engine.run_backward(
  File "C:\Users\X\Anaconda3\envs\graphs\lib\site-packages\wandb\wandb_torch.py", line 398, in backward_hook
    wandb.run.summary["graph_%i" % graph_idx] = graph
  File "C:\Users\X\Anaconda3\envs\graphs\lib\site-packages\wandb\sdk\wandb_summary.py", line 57, in __setitem__
    self.update({key: val})
  File "C:\Users\X\Anaconda3\envs\graphs\lib\site-packages\wandb\sdk\wandb_summary.py", line 79, in update
    self._update(record)
  File "C:\Users\X\Anaconda3\envs\graphs\lib\site-packages\wandb\sdk\wandb_summary.py", line 133, in _update
    self._update_callback(record)
  File "C:\Users\X\Anaconda3\envs\graphs\lib\site-packages\wandb\sdk\wandb_run.py", line 670, in _summary_update_callback
    self._backend.interface.publish_summary(summary_record)
  File "C:\Users\X\Anaconda3\envs\graphs\lib\site-packages\wandb\sdk\interface\interface.py", line 538, in publish_summary
    pb_summary_record = self._make_summary(summary_record)
  File "C:\Users\X\Anaconda3\envs\graphs\lib\site-packages\wandb\sdk\interface\interface.py", line 333, in _make_summary
    json_value = self._summary_encode(item.value, path_from_root)
  File "C:\Users\X\Anaconda3\envs\graphs\lib\site-packages\wandb\sdk\interface\interface.py", line 296, in _summary_encode
    data_types.val_to_json(
  File "C:\Users\X\Anaconda3\envs\graphs\lib\site-packages\wandb\data_types.py", line 2948, in val_to_json
    val.bind_to_run(run, key, namespace)
  File "C:\Users\X\Anaconda3\envs\graphs\lib\site-packages\wandb\data_types.py", line 2510, in bind_to_run
    util.json_dump_safer(data, codecs.open(tmp_path, "w", encoding="utf-8"))
  File "C:\Users\X\Anaconda3\envs\graphs\lib\codecs.py", line 905, in open
    file = builtins.open(filename, mode, buffering)
FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Users\\X\\AppData\\Local\\Temp\\tmpgv0qht61wandb-media\\ai5vmd11.graph.json'

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

Wandb features PyTorch Lightning WandbLogger.

How to reproduce Unfortunately this seems to happen completely randomly. Most of the times runs behave as expected, but on average 1 in a 30 runs crashes, which is bothersome as it crashes my whole multiruns. Happened many times with different setups. No easy way to reproduce.

Environment

  • OS: Windows10
  • Environment: conda
  • Python Version: 3.8.6

Any idea what might cause it or how to prevent it?

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 7
  • Comments: 24 (7 by maintainers)

Most upvoted comments

Running into a similar issue when logging a wandb.Table object and my run sometimes crashes and sometimes not.

I’m using pytorch-lightning, hydra and wandb (all latest versions) on a SLURM cluster.

FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmpksnn7j_ywandb-media/2l3ay0ql.table.json'

My running theory is that because I am using SLURM it is possible that the /tmp/ directory is cleared by another user/process of the node, and therefore the temporary folder where the table.json was supposed to be saved, now does not exist anymore.

I can reproduce this error by deleting the temporary media folder created by wandb tmpksnn7j_ywandb-media (in this case) while the code is running.

Hi - I am having a similar error when I use multiprocessing to spin up multiple agents for wandb sweeps. The error looks something like

FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmpqk8yfsrewandb-media/22jc9qcw.png'

and it halts my runs.

Would be awesome to have this fixed. I have a strong preference for using the multiprocessing module over other approaches to parallelizing.

@gautierdag My running theory is that because I am using SLURM it is possible that the /tmp/ directory is cleared by another user/process of the node, and therefore the temporary folder where the table.json was supposed to be saved, now does not exist anymore.

I am also running into this issue when using hydra, wandb, and multiprocessing:

Traceback (most recent call last):
  File "/home/exing/miniconda3/envs/ENV/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/exing/miniconda3/envs/ENV/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/exing/repos/hydra_main.py", line 547, in <module>
    main()
  File "/home/exing/miniconda3/envs/ENV/lib/python3.7/site-packages/hydra/main.py", line 52, in decorated_main
    config_name=config_name,
  File "/home/exing/miniconda3/envs/ENV/lib/python3.7/site-packages/hydra/_internal/utils.py", line 378, in _run_hydra
    lambda: hydra.run(
  File "/home/exing/miniconda3/envs/ENV/lib/python3.7/site-packages/hydra/_internal/utils.py", line 211, in run_and_report
    return func()
  File "/home/exing/miniconda3/envs/ENV/lib/python3.7/site-packages/hydra/_internal/utils.py", line 381, in <lambda>
    overrides=args.overrides,
  File "/home/exing/miniconda3/envs/ENV/lib/python3.7/site-packages/hydra/_internal/hydra.py", line 106, in run
    configure_logging=with_log_configuration,
  File "/home/exing/miniconda3/envs/ENV/lib/python3.7/site-packages/hydra/core/utils.py", line 160, in run_job
    ret.return_value = task_function(task_cfg)
  File "/home/exing/repos/hydra_main.py", line 259, in main
    num_batches=FLAGS.num_actor_batches,
  File "/home/exing/miniconda3/envs/ENV/lib/python3.7/weakref.py", line 648, in _exitfunc
    f()
  File "/home/exing/miniconda3/envs/ENV/lib/python3.7/weakref.py", line 572, in __call__
    return info.func(*info.args, **(info.kwargs or {}))
  File "/home/exing/miniconda3/envs/ENV/lib/python3.7/shutil.py", line 483, in rmtree
    orig_st = os.lstat(path)
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmpvfdqzesiwandb-media'

An outline of hydra_main.py:

import hydra
import wandb

@hydra.main(config_path="./conf", config_name="conf_file")
def main(cfg):
    spawn_processes_fn()

    exit()  # FileNotFoundError occurs

    if cfg.wandb:
        wandb.init(...)

if __name__ == "__main__":
    main()

Note that this example is not even calling wandb.init(), but wandb is imported at the top of the file. However, running:

import hydra

@hydra.main(config_path="./conf", config_name="conf_file")
def main(cfg):
    spawn_processes_fn()

    exit()  # FileNotFoundError does NOT occur

    if cfg.wandb:
        import wandb  # moved import
        wandb.init(...)

if __name__ == "__main__":
    main()

does not cause this exception.

Hi there, sorry to hear about these issues. Does anyone on this thread have a minimal code sample that reproduces this issue? We’re a little stumped here – it definitely looks like Python’s tempfile module (which is what we’re using under the hood) is failing here.

One suspicion: are any of these examples making use of multiprocessing? I think I’ve read that tempfiles can misbehave if used from a process other than the one that originally created tempfile/tempdir.

I am having a similar issue – after a while the process fails with the following error message, although uncomplete logs are still saved and synchronized:

Traceback (most recent call last):
  File "T:\Studies\SOIN_HUG\code\LSTM_model\train.py", line 138, in <module>
    run('train_configs.yml', weight_path='cache/weights')
  File "T:\Studies\SOIN_HUG\code\LSTM_model\train.py", line 129, in run
    wandb.log({"ROC" : wandb.plot.line(table, "FPR", "TPR",
  File "C:\Users\LevyfidelC\Anaconda3\lib\site-packages\wandb\sdk\wandb_run.py", line 370, in wrapper
    return func(self, *args, **kwargs)
  File "C:\Users\LevyfidelC\Anaconda3\lib\site-packages\wandb\sdk\wandb_run.py", line 333, in wrapper
    return func(self, *args, **kwargs)
  File "C:\Users\LevyfidelC\Anaconda3\lib\site-packages\wandb\sdk\wandb_run.py", line 1703, in log
    self._log(data=data, step=step, commit=commit)
  File "C:\Users\LevyfidelC\Anaconda3\lib\site-packages\wandb\sdk\wandb_run.py", line 1485, in _log
    self._partial_history_callback(data, step, commit)
  File "C:\Users\LevyfidelC\Anaconda3\lib\site-packages\wandb\sdk\wandb_run.py", line 1364, in _partial_history_callback
    self._backend.interface.publish_partial_history(
  File "C:\Users\LevyfidelC\Anaconda3\lib\site-packages\wandb\sdk\interface\interface.py", line 568, in publish_partial_history
    data = history_dict_to_json(run, data, step=user_step, ignore_copy_err=True)
  File "C:\Users\LevyfidelC\Anaconda3\lib\site-packages\wandb\sdk\data_types\utils.py", line 52, in history_dict_to_json
    payload[key] = val_to_json(
  File "C:\Users\LevyfidelC\Anaconda3\lib\site-packages\wandb\sdk\data_types\utils.py", line 160, in val_to_json
    val.bind_to_run(run, key, namespace)
  File "C:\Users\LevyfidelC\Anaconda3\lib\site-packages\wandb\data_types.py", line 527, in bind_to_run
    with codecs.open(tmp_path, "w", encoding="utf-8") as fp:
  File "C:\Users\LevyfidelC\Anaconda3\lib\codecs.py", line 905, in open
    file = builtins.open(filename, mode, buffering)
FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Users\\LEVYFI~1\\AppData\\Local\\Temp\\tmplhrdta6qwandb-media\\tfjgti4e.table.json'
wandb: Waiting for W&B process to finish... (failed 1). Press Ctrl-C to abort syncing.

I am not multiprocessing and I placed import wandb right before calling wandb.init() as suggested but it didn’t prevent the error from happening.

Running into a similar issue when logging a wandb.Table object and my run sometimes crashes and sometimes not.

I’m using pytorch-lightning, hydra and wandb (all latest versions) on a SLURM cluster.

FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmpksnn7j_ywandb-media/2l3ay0ql.table.json'

My running theory is that because I am using SLURM it is possible that the /tmp/ directory is cleared by another user/process of the node, and therefore the temporary folder where the table.json was supposed to be saved, now does not exist anymore.

I can reproduce this error by deleting the temporary media folder created by wandb tmpksnn7j_ywandb-media (in this case) while the code is running.

Same issue here. However, I am not running a multiprocess. The problem indeed occurs when logging the table. For some runs it works, for others it doesn’t. Totally random. I am using the most recent wandb version.