wandb: GPU memory leak while running sweeps

wandb --version && python --version && uname

Weights and Biases version: 0.9.7 Python version: 3.7.9 Operating System: Ubuntu 18.04LTS

Description

I’m running sweeps, and I notice that every so often one of the GPUs doesn’t reclaim all its memory after a training job goes away. It ends up in this horrible CUDA-bug state where nvidia-smi reports that the memory is used in the top half, but in the bottom half doesn’t report any processes that owns that memory. I can only reclaim the memory by rebooting the machine. (I’ve read that sometimes nvidia-smi -r will fix this, but it’s never let me reset the GPU that way I think because X-windows is running on it.)

What I Did

This is not a great bug report, because I don’t know how to repro it. I’m not even sure it’s anything to do with wandb, or just some bug between CUDA & pytorch or something. But I’ve seen it three or four times now, and only when running wandb sweeps. I’ve mostly been using hyperband early termination with my sweeps. And I sometimes will kill jobs manually from the wandb web UI. So I suspect it’s maybe got something to do with the way the agent kills the python process that’s using the GPU - maybe it’s not cleaning up properly.

About this issue

Original URL
State: closed
Created 4 years ago
Reactions: 6
Comments: 32 (9 by maintainers)

Most upvoted comments

Albeit closed, I thought I would offer my 2 cents on this since I was wrestling with this issue for the past few days. Looks like freeing up the resources at the end of the training functions does the trick.
# main training loop
def train(config=None):
  with wandb.init(config=config):
    config = wandb.config

    ...

  # cleanup
  del model
  torch.cuda.empty_cache()


# running sweep
wandb.agent(sweep_id, train)
wandb.finish()
That way you can still use the python approach without having to define the .yaml file.

This doesn’t help the “memory leak” that occurs when I Ctrl+C to terminate the programs though.

Jacfger on Mar 7, 2023

I see, thanks for following up. We’re looking into fixing this issue.

tyomhak on Oct 21, 2020

hello, i meet the same error while using sweep module. The program can not free the GPU memory by itself, i have to clean up the GPU memory after that. And seems i can not kill the program using ctrl+c, it will print a wandb log info “ctrl+c pressed” and run as normal. If I press ctrl+c twice, then the program is killed with leaked GPU memory. Any solutions on this issue? Thanks a lot.

zoeyuchao on Oct 15, 2020

Issue-Label Bot is automatically applying the label bug to this issue, with a confidence of 0.93. Please mark this comment with 👍 or 👎 to give our bot feedback!

Links: app homepage, dashboard and code for this bot.

issue-label-bot[bot] on Sep 18, 2020

I’m having the same issue https://github.com/huggingface/transformers/issues/10885

joawar on Mar 24, 2021