wandb: [CLI]: wandb: ERROR Failed to sample metric: psutil.NoSuchProcess process no longer exists (pid=564)
Describe the bug
Hi, I’m training a yolov5 model in colab using weights and biases for the checkpoint feature. However, sometimes I’m getting the error pasted below. This can happen at the start of the training or in middle of the training. Last time my training got killed after 130 epochs due to this error.
!python train.py --img 1280 --rect --batch -1 --epochs 200 --data /content/drive-mnt/dataset --weights yolov5s.pt --save-period 5 --upload_dataset --cache
0/199 7.41G 0.1254 0.101 0.03192 33 1280: 0% 0/106 [00:02<?, ?it/s]wandb: ERROR Failed to sample metric: psutil.NoSuchProcess process no longer exists (pid=564)
A second issue I notice is that for some reason I cannot use --resume with yolov5 and w&b path anymore
!python train.py --resume wandb-artifact://redacted/YOLOv5/3laxf9ci
gives me
AssertionError: File not found: wandb-artifact://redacted/YOLOv5/3laxf9ci
I double checked and the path does exist in my w&b account.
Furthermore, for both issues I’m logged in correctly.
The wandb version I’m using is wandb==0.13.6
Any idea what might be causing this?
Additional Files
No response
Environment
WandB version:0.13.6
OS:Google Colab (ubuntu?)
Python version:3.8.16
Versions of relevant libraries:
Additional Context
No response
About this issue
- Original URL
- State: open
- Created 2 years ago
- Reactions: 2
- Comments: 26 (9 by maintainers)
I can confirm this bug, I get the same error message, typically happens after hours of training