DALI: Possible Memory leak with multi gpu training?

Version

1.21

Describe the bug.

I used the code from tutorial to train ImageNet (https://github.com/NVIDIA/DALI/blob/main/docs/examples/use_cases/pytorch/resnet50/main.py) , I have six 1080ti gpus.

However, the memory comsumption of 6th gpu was always larger and increasing when training, and would throw the OOM exception in the middle of training.

for example, here are my nvidia smi, the memory comsumption of 6th gpu was larger when compared to others.

image

Any idea?

Minimum reproducible example

Ref https://github.com/NVIDIA/DALI/blob/main/docs/examples/use_cases/pytorch/resnet50/main.p

Relevant log output

No response

Other/Misc.

No response

Check for duplicates

  • I have searched the open bugs/issues and have found no duplicates for this bug report

About this issue

  • Original URL
  • State: open
  • Created 9 months ago
  • Comments: 23 (4 by maintainers)

Most upvoted comments

that sounds a good try.

@twmht that is not expected from the DALI side. Could you run the DALI pipeline alone without the training and see if that memory growth still occurs to rule out the DL FW itself?

Hi @twmht,

You can find all changes introduced in the recent DALI releases here, we fixed at least one memory leak detected. Also regarding uneven memory consumption, as DALI uses memory pools, when for a given GPU the memory usage crosses a given threshold, another chunk is allocated and that is why one GPU can use more than the others (the memory consumed by randomly formed batches of samples could be just higher). You can also consider reducing the batch size to reduce consumed memory.