djl: memory leak and duration increase during training

Description

During training on GPU I experience

  • 4 MiB memory leak on GPU per epoch (looks constant)
  • duration increase about 1 min per epoch (looks linear)

Expected Behavior

no memory leak, roughly constant duration per epoch

How to Reproduce?

I set up a toy app based on djl mnist to reproduce the problem I experience:

git clone https://github.com/enpasos/reproducebug1.git
cd reproducebug1
gradlew build
java -jar app/build/libs/app-0.0.1-SNAPSHOT.jar

Environment Info

  • GPU: NVIDIA GeForce RTX 3090
  • CPU: AMD Ryzen 9 3950X 16-Core Processor
  • RAM: 64 GB
  • OS: Edition Windows 11 Pro, Version 22H2, Betriebssystembuild 22623.1020
  • GPU Driver: 522.25
  • CUDA SDK: 11.6.2
  • CUDNN: cudnn-windows-x86_64-8.5.0.96_cuda11
  • Java: Corretto-17.0.3.6.1
  • DJL: 0.21.0-SNAPSHOT (05.12.2022)
  • PYTORCH: 1.12.1

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 27 (27 by maintainers)

Most upvoted comments

@enpasos Just FYI, the memory leak and efficiency problem has been attemped to solve in this PR: https://github.com/deepjavalibrary/djl/pull/2232

Hi @enpasos, it sounds like your idea is somewhat built on top of the existing resource management, and is different than what we have tried before, for example, the hybrid GC @zachgk mentioned. If you are sure about this, then yes, your pr and contribution is very much appreciated. On the other hand, the resource management is something we have to be very cautious about, and it is also where we have spent a lot of effort designing before. So it is important to make sure you don’t wast time on repeating what we have already tried.

If you already have an implementaion at hand, then you are always welcome to open a PR, which may help make the discussion more concrete and clear. But still, the key issue is to make sure your desgin is indeed something we didn’t consider before, and can pass the restrictions we have considered before when designing the resource management in the first place. This probably needs more specific details of the design and further discussions, where, again, a tentative PR or some sort of design document may be helpful. I’m not an expert in this part. @lanking520 @zachgk @frankfliu have a greater say in this area.

Hi @enpasos, thanks a lot for pointing out this problem and coming up with the solution and the POC. Indeed currently we manage the native resources by NDManager hierarchy, and it is indeed a burden on implementation to worry about the inequivalance between c1=a.add(b) and c2=b.add(a), as well as the consideration of which NDManager to attach to. Usually to make a change on the level of resources management, we will need to estimate the effort cost, the compatibility and the potential impact. It is great to know that the implementation with dynamic proxy propose requires not much effort. It’s also great that you have implemented this POC. Let me bring it to the team meeting to look into this nice idea.

Could you please help here @KexinFeng