deeplearning4j: Using Workspaces and disabling GC calls ends in java.lang.OutOfMemoryError
Issue Description
When I switch my neural net from JVM gc to workspaces by adding .trainingWorkspaceMode(WorkspaceMode.SEPARATE) to my model and disabling periodic gc calls with Nd4j.getMemoryManager().togglePeriodicGc(false);, memory consumption explodes during training, and after a few epochs, I get:
java.lang.OutOfMemoryError: Cannot allocate new FloatPointer(1): totalBytes = -3051750267, physicalBytes = 6G
at org.bytedeco.javacpp.FloatPointer.<init>(FloatPointer.java:76)
at org.bytedeco.javacpp.FloatPointer.<init>(FloatPointer.java:41)
at org.nd4j.linalg.jcublas.blas.JcublasLevel3.sgemm(JcublasLevel3.java:107)
at org.nd4j.linalg.api.blas.impl.BaseLevel3.gemm(BaseLevel3.java:57)
at org.nd4j.linalg.api.ndarray.BaseNDArray.mmuli(BaseNDArray.java:3011)
at org.nd4j.linalg.api.ndarray.BaseNDArray.mmul(BaseNDArray.java:2812)
at org.deeplearning4j.nn.layers.BaseLayer.preOutput(BaseLayer.java:317)
at org.deeplearning4j.nn.layers.BaseLayer.activate(BaseLayer.java:328)
at org.deeplearning4j.nn.layers.recurrent.RnnOutputLayer.output(RnnOutputLayer.java:149)
at org.deeplearning4j.nn.layers.BaseOutputLayer.activate(BaseOutputLayer.java:189)
at org.deeplearning4j.nn.multilayer.MultiLayerNetwork.activationFromPrevLayer(MultiLayerNetwork.java:789)
at org.deeplearning4j.nn.multilayer.MultiLayerNetwork.feedForwardToLayer(MultiLayerNetwork.java:929)
at org.deeplearning4j.nn.multilayer.MultiLayerNetwork.feedForward(MultiLayerNetwork.java:870)
at org.deeplearning4j.nn.multilayer.MultiLayerNetwork.feedForward(MultiLayerNetwork.java:861)
at org.deeplearning4j.nn.multilayer.MultiLayerNetwork.silentOutput(MultiLayerNetwork.java:1906)
at org.deeplearning4j.nn.multilayer.MultiLayerNetwork.silentOutput(MultiLayerNetwork.java:1936)
at org.deeplearning4j.nn.multilayer.MultiLayerNetwork.doEvaluation(MultiLayerNetwork.java:2892)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44)
at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41)
at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
at com.intellij.junit4.JUnit45ClassesRequestBuilder$1$1$2$2.runChild(JUnit45ClassesRequestBuilder.java:82)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:49)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:193)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:52)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:191)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:42)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:184)
at org.junit.runners.ParentRunner.run(ParentRunner.java:236)
at org.junit.runner.JUnitCore.run(JUnitCore.java:157)
at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:68)
at com.intellij.rt.execution.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:47)
at com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:242)
at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:70)
Caused by: java.lang.OutOfMemoryError: Physical memory usage is too high: physicalBytes = 6G > maxPhysicalBytes = 6G
at org.bytedeco.javacpp.Pointer.deallocator(Pointer.java:576)
at org.bytedeco.javacpp.Pointer.init(Pointer.java:121)
at org.bytedeco.javacpp.FloatPointer.allocateArray(Native Method)
at org.bytedeco.javacpp.FloatPointer.<init>(FloatPointer.java:68)
... 38 more
With normal JVM gc, only a fraction of the memory is used, and training is a little bit faster, too. Which should not be the case. Also tried to disable evaluation after each epoch, but same issue. Here is the java code of the neural net config and the iterator I am using:
https://gist.github.com/Tschigger/e451fdc68b13d19157478b7b4084ec62
Version Information
Ubuntu System RAM: 16gb GPU RAM: 8GB (1070)
- deeplearning4j-cuda-8.0, version 0.9.1
- nd4j-cuda-8.0-platform, version 0.9.1
- datavec-api, version 0.9.1
Contributing
If you’d like to help us fix the issue by contributing some code, but would like guidance or help in doing so, please mention it!
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Comments: 30 (16 by maintainers)
It doesn`t matter if I set training or inference workspaces to single or separate or if I outcomment the parameters completely in the network building process. The result is always
[training: NONE; inference: SEPARATE]no matter what I do or set.