TensorFlow.NET: Error run model on GPU: Out of GPU memory

System information

OS: Windows 10 Pro
TensorFlow dll version: 2.0
TensorFlow.NET version: 0.11.1.0
GPU: NVIDIA GeForce RTX 2060 6Go
CUDA version: 10.1.243
cuDNN version: 7.6.4.38

Describe the problem I am using TensorFlow.NET to run a saved model and I have an issue to run it on the GPU. The learning use aroung 1.5Go on the GPU memory so it should be the same for the prediction but it didn’t. When I run the model on the CPU, the prediction is performed but on the GPU it didn’t and I have the following error message. I use the following code to set the options of the session for GPU usage. Using nvidia-smi I saw that almost all GPU memory (around 5.8Go) is allocated when run method is executed. So it looks like the definition of memory fraction per process is not used. Is it an error in my code to configure the session or an internal bug of TensorFlow.NET?

Code sample

SessionOptions sessionOptions = new SessionOptions();
ConfigProto config = new ConfigProto();
config.LogDevicePlacement = true;
config.AllowSoftPlacement = true;
GPUOptions gpuOptions = new GPUOptions();
gpuOptions.AllowGrowth = true;
gpuOptions.PerProcessGpuMemoryFraction = 0.5;
config.DeviceCount.Add("GPU", 1);
config.GpuOptions = gpuOptions;
sessionOptions.SetConfig(config);
_sessionPredict = new Session(_graph, sessionOptions);

Error message

2019-10-28 12:37:34.537578: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_ALLOC_FAILED
2019-10-28 12:37:34.540489: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_ALLOC_FAILED
2 root error(s) found.
  (0) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
         [[{{node Conv1_1/convConv}}]]
         [[LossScope/outputPrediction/_43]]
  (1) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
         [[{{node Conv1_1/convConv}}]]
0 successful operations.
0 derived errors ignored.

About this issue

Original URL
State: closed
Created 5 years ago
Comments: 17 (7 by maintainers)

Most upvoted comments

That is exactly what I do, and it is the same as @quintakFR 's code in the first post in this thread. Anyway, it doesn’t matter to me that it allocates all memory, the important part is that the convolution woks after downgrading to cuDNN 7.4.

meesoft on Nov 23, 2019