djl: memory leak and duration increase during training
Description
During training on GPU I experience
- 4 MiB memory leak on GPU per epoch (looks constant)
- duration increase about 1 min per epoch (looks linear)
Expected Behavior
no memory leak, roughly constant duration per epoch
How to Reproduce?
I set up a toy app based on djl mnist to reproduce the problem I experience:
git clone https://github.com/enpasos/reproducebug1.git
cd reproducebug1
gradlew build
java -jar app/build/libs/app-0.0.1-SNAPSHOT.jar
Environment Info
- GPU: NVIDIA GeForce RTX 3090
- CPU: AMD Ryzen 9 3950X 16-Core Processor
- RAM: 64 GB
- OS: Edition Windows 11 Pro, Version 22H2, Betriebssystembuild 22623.1020
- GPU Driver: 522.25
- CUDA SDK: 11.6.2
- CUDNN: cudnn-windows-x86_64-8.5.0.96_cuda11
- Java: Corretto-17.0.3.6.1
- DJL: 0.21.0-SNAPSHOT (05.12.2022)
- PYTORCH: 1.12.1
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 27 (27 by maintainers)
@enpasos Just FYI, the memory leak and efficiency problem has been attemped to solve in this PR: https://github.com/deepjavalibrary/djl/pull/2232
Hi @enpasos, it sounds like your idea is somewhat built on top of the existing resource management, and is different than what we have tried before, for example, the hybrid GC @zachgk mentioned. If you are sure about this, then yes, your pr and contribution is very much appreciated. On the other hand, the resource management is something we have to be very cautious about, and it is also where we have spent a lot of effort designing before. So it is important to make sure you don’t wast time on repeating what we have already tried.
If you already have an implementaion at hand, then you are always welcome to open a PR, which may help make the discussion more concrete and clear. But still, the key issue is to make sure your desgin is indeed something we didn’t consider before, and can pass the restrictions we have considered before when designing the resource management in the first place. This probably needs more specific details of the design and further discussions, where, again, a tentative PR or some sort of design document may be helpful. I’m not an expert in this part. @lanking520 @zachgk @frankfliu have a greater say in this area.
Hi @enpasos, thanks a lot for pointing out this problem and coming up with the solution and the POC. Indeed currently we manage the native resources by NDManager hierarchy, and it is indeed a burden on implementation to worry about the inequivalance between
c1=a.add(b)andc2=b.add(a), as well as the consideration of which NDManager to attach to. Usually to make a change on the level of resources management, we will need to estimate the effort cost, the compatibility and the potential impact. It is great to know that the implementation with dynamic proxy propose requires not much effort. It’s also great that you have implemented this POC. Let me bring it to the team meeting to look into this nice idea.Could you please help here @KexinFeng