tensorflow: GPU becomes unavailable after computer wakes up
I noticed many have issues with GPU being unavailable with message (e.g., issue 394)
E tensorflow/stream_executor/cuda/cuda_driver.cc:491] failed call to cuInit: CUDA_ERROR_UNKNOWN
some suggested sudo apt-get install nvidia-modprobe but it does not work for all including me. my GPU works until i put the computer to sleep/suspense, but after waking up the computer i always get the message above and the GPU (gtx 1070) is no longer available in execution of the code (only CPU is used) in nvidia docker. I also noticed if prior to suspending the computer i exit the docker and then restart it when i wake the computer the GPU is still available in docker. So, the problem happens if i suspend the computer while the ipython-notebook session is up and running.
I am using nvidia-docker
nvidia-docker run -it -p 8888:8888 -v /*..../Data/docker:/docker --name TensorFlow gcr.io/tensorflow/tensorflow:latest-gpu /bin/bash
Nvidia-smi and nvidia-debugdump -l both show the GPU is installed and driver is up to date within docker and in the host.
when i run nvidia-smi in docker the output is
±----------------------------------------------------------------------------+ | NVIDIA-SMI 367.57 Driver Version: 367.57 | |-------------------------------±---------------------±---------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GeForce GTX 1070 Off | 0000:01:00.0 On | N/A | | 0% 41C P0 39W / 180W | 450MiB / 8105MiB | 0% Default | ±------------------------------±---------------------±---------------------+
±----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| ±----------------------------------------------------------------------------+
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcurand.so locally
E tensorflow/stream_executor/cuda/cuda_driver.cc:491] failed call to cuInit: CUDA_ERROR_UNKNOWN
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:153] retrieving CUDA diagnostic information for host: ca234sff235
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:160] hostname: ca234sff235
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:185] libcuda reported version is: 367.57.0
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:356] driver version file contents: """NVRM version: NVIDIA UNIX x86_64 Kernel Module 367.57 Mon Oct 3 20:37:01 PDT 2016
GCC version: gcc version 4.9.3 (Ubuntu 4.9.3-13ubuntu2)
"""
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:189] kernel reported version is: 367.57.0
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:293] kernel version seems to match DSO: 367.57.0
Software specs: OS: Ubuntu 16.04 LTS - 64 bit GPU driver: nvidia 367.57 Cuda : 7.5
About this issue
- Original URL
- State: closed
- Created 8 years ago
- Reactions: 6
- Comments: 33 (7 by maintainers)
Hello all! I was running in the same issue and finally managed to make it work after resume without rebooting the computer.
You just need to
rmmodthenvidiamodule (and all dependent modules, in my casenvidia_uvm) and thenmodprobethem again (in reverse order):Hope this helps 😃
Same issue for me (Linux Mint 18.3, Nvidia Quadro M1200, driver version 384.130, Cuda 9.0 and Keras/TF).
This does not work for me, I get stuck at
sudo rmmod nvidia_drm:It prevents other modules (nvidia_modeset and nvidia) from being stopped.
Thanks @pierrekilly ! That worked for me. I was able to get around the issue by running the following script after my devbox wakes up.
Is there a solution to this problem for people that only have one gpu and it is being used as a display?
@harpone Try like this:
You have to remove the modules that use the
nvidiaone before beeing able to remove it.Is it recommended to kill Xorg on every suspend?
@zheng-xq, this is the test that I ran. My interpretation is that the issue is caused by TensorFlow:
import tensorflow as tftf.test.gpu_device_name()– OKtf.test.gpu_device_name()– FAILEDCommand output details
Successful vectorAdd
TensorFlow import
Successful gpu_device_name
Failed vectorAdd
Failed gpu_device_name
I’m having the exact same bug using the docker version of tensorflow. Specs:
Killing the jupyter notebook (or any of the processes using the GPU) never worked for me.
However I did not have this problem anymore since I started using the recently introduced nvidia systemd services:
nvidia-suspend,nvidia-resumeandnvidia-hybernatementioned e.g. here.To make it work I added a
nvidia-suspend.confto/etc/modprobe.d/which just contains:and then enabled these three services. So far I did not have any problems again after using this setup for about a month with several suspends.
I have Fedora 34 RTX 3080 (mobile) Driver 465.31 CUDA 11.3
If you meet
rmmod: ERROR: Module nvidia_drm is in usewhen you are trying to do
sudo rmmod nvidia_uvmsudo rmmod nvidia_drmsudo rmmod nvidia_modesetsudo rmmod nvidiasudo modprobe nvidiasudo modprobe nvidia_modesetsudo modprobe nvidia_drmsudo modprobe nvidia_uvmBe careful to do the following, if you are unfamiliar with how to use the command line control of Linux
You may want to try the following
systemctl isolate multi-user.target(your graphic interface will be offline)systemctl start graphical.target(your low-resolution interface is back)I’d prefer to verify this by cd into the end of /samples within CUDA-XX folder, and run ./bandwidthTest, e.g., my folder is at /usr/local/cuda-11.2/samples/bin/x86_64/linux/release then I run
./bandwidthTestIf you can get this test passed, your device should be available now!For what it’s worth this is still a problem in 2021 for me on Fedora 33. I tried to follow this advise but I could also not unload
nvidia_drmfor the same reason. But after loading the first module in the list again, things worked for me again. So for me it was just:Not claiming this works for everyone though.
@mmakowski I had this message because I didn’t stop my Jupyter notebook server. Make sure you stop everything that could hold the GPU before doing this. I have an hybrid graphic card, so the
module is in useproblem might also be due to the GPU beeing the only graphic card, as you say, but it’s worth a try 😃I still have not found a fix either