wandb: [CLI]: wandb: ERROR Failed to sample metric: psutil.NoSuchProcess process no longer exists (pid=564)

Describe the bug

Hi, I’m training a yolov5 model in colab using weights and biases for the checkpoint feature. However, sometimes I’m getting the error pasted below. This can happen at the start of the training or in middle of the training. Last time my training got killed after 130 epochs due to this error.

!python train.py --img 1280 --rect --batch -1 --epochs 200 --data /content/drive-mnt/dataset --weights yolov5s.pt --save-period 5 --upload_dataset --cache
      0/199      7.41G     0.1254      0.101    0.03192         33       1280:   0% 0/106 [00:02<?, ?it/s]wandb: ERROR Failed to sample metric: psutil.NoSuchProcess process no longer exists (pid=564)

A second issue I notice is that for some reason I cannot use --resume with yolov5 and w&b path anymore

!python train.py --resume wandb-artifact://redacted/YOLOv5/3laxf9ci

gives me

AssertionError: File not found: wandb-artifact://redacted/YOLOv5/3laxf9ci

I double checked and the path does exist in my w&b account. Furthermore, for both issues I’m logged in correctly. The wandb version I’m using is wandb==0.13.6 Any idea what might be causing this?

Additional Files

No response

Environment

WandB version:0.13.6

OS:Google Colab (ubuntu?)

Python version:3.8.16

Versions of relevant libraries:

Additional Context

No response

About this issue

  • Original URL
  • State: open
  • Created 2 years ago
  • Reactions: 2
  • Comments: 26 (9 by maintainers)

Most upvoted comments

I can confirm this bug, I get the same error message, typically happens after hours of training

Traceback (most recent call last):
  File "inpainting/train_autoregressive.py", line 337, in <module>
    load_ckpt=args.load_ckpt
  File "/home/lennartv/.conda/envs/neo38/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/lennartv/.conda/envs/neo38/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
    while not context.join():
  File "/home/lennartv/.conda/envs/neo38/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 140, in join
    raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 1 terminated with signal SIGKILL
wandb: ERROR Failed to sample metric: psutil.NoSuchProcess process no longer exists (pid=6029)