wandb: GPU memory leak while running sweeps
wandb --version && python --version && uname
Weights and Biases version: 0.9.7 Python version: 3.7.9 Operating System: Ubuntu 18.04LTS
Description
I’m running sweeps, and I notice that every so often one of the GPUs doesn’t reclaim all its memory after a training job goes away. It ends up in this horrible CUDA-bug state where nvidia-smi reports that the memory is used in the top half, but in the bottom half doesn’t report any processes that owns that memory. I can only reclaim the memory by rebooting the machine. (I’ve read that sometimes nvidia-smi -r will fix this, but it’s never let me reset the GPU that way I think because X-windows is running on it.)
What I Did
This is not a great bug report, because I don’t know how to repro it. I’m not even sure it’s anything to do with wandb, or just some bug between CUDA & pytorch or something. But I’ve seen it three or four times now, and only when running wandb sweeps. I’ve mostly been using hyperband early termination with my sweeps. And I sometimes will kill jobs manually from the wandb web UI. So I suspect it’s maybe got something to do with the way the agent kills the python process that’s using the GPU - maybe it’s not cleaning up properly.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 6
- Comments: 32 (9 by maintainers)
This doesn’t help the “memory leak” that occurs when I Ctrl+C to terminate the programs though.
I see, thanks for following up. We’re looking into fixing this issue.
hello, i meet the same error while using sweep module. The program can not free the GPU memory by itself, i have to clean up the GPU memory after that. And seems i can not kill the program using ctrl+c, it will print a wandb log info “ctrl+c pressed” and run as normal. If I press ctrl+c twice, then the program is killed with leaked GPU memory. Any solutions on this issue? Thanks a lot.
Issue-Label Bot is automatically applying the label
bugto this issue, with a confidence of 0.93. Please mark this comment with 👍 or 👎 to give our bot feedback!Links: app homepage, dashboard and code for this bot.
I’m having the same issue https://github.com/huggingface/transformers/issues/10885