tensorflow: TF 2.16.1 Fails to work with GPUs
Issue type
Bug
Have you reproduced the bug with TensorFlow Nightly?
No
Source
binary
TensorFlow version
TF 2.16.1
Custom code
No
OS platform and distribution
Linux Ubuntu 22.04.4 LTS
Mobile device
No response
Python version
3.10.12
Bazel version
No response
GCC/compiler version
No response
CUDA/cuDNN version
12.4
GPU model and memory
No response
Current behavior?
I created a python venv in which I installed TF 2.16.1 following your instructions: pip install tensorflow When I run python, import tf, and issue tf.config.list_physical_devices(‘GPU’) I get an empty list [ ]
I created another python venv, installed TF 2.16.1, only this time with the instructions:
python3 -m pip install tensorflow[and-cuda]
When I run that version, import tensorflow as tf, and issue
tf.config.list_physical_devices(‘GPU’)
I also get an empty list.
BTW, I have no problems running on my box TF 2.15.1 with GPUs. Julia also works just fine with GPUs and so does PyTorch. the
Standalone code to reproduce the issue
Python 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
2024-03-09 19:15:45.018171: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-03-09 19:15:50.412646: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
>>> tf.__version__
'2.16.1'
tf.config.list_physical_devices('GPU')
2024-03-09 19:16:28.923792: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-03-09 19:16:29.078379: W tensorflow/core/common_runtime/gpu/gpu_device.cc:2251] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
[]
>>>
Relevant log output
No response
About this issue
- Original URL
- State: open
- Created 4 months ago
- Reactions: 30
- Comments: 95 (2 by maintainers)
It’s just tensorflow can’t see the Cuda libraries.
Instal tensorflow[and-cuda] and add this to your .bashrc or conda activation script. Adjust python version in it according to your setup.
NVIDIA_PACKAGE_DIR=“$CONDA_PREFIX/lib/python3.12/site-packages/nvidia”
for dir in $NVIDIA_PACKAGE_DIR/*; do if [ -d “$dir/lib” ]; then export LD_LIBRARY_PATH=“$dir/lib:$LD_LIBRARY_PATH” fi done
You won’t need to install cuda or cudnn on the system. only the cuda libraries that are installed with $ pip install tensorflow[and-cuda] would be enough.
On Mon, Mar 11, 2024, 7:04 a.m. Juan E. Vargas @.***> wrote:
I am closing this (unresolved issue) because I am told by the Keras/TF team that the issue is related to TF.
Almost final and automated fix below
Where I found the resolution
Exact solution
What else helped me
I think the expected solution would be a new release that fixes this issue, so setting LD_LIBRARY_PATH is not needed like it is done in 2.15.1, it would be downgrade for users to do such workarounds it should just work with: pip install tensorflow[and-cuda]
Well, I wasted 8hr of my Sunday on this setting up another pc from scratch. Before reverting to the old version. Now looking to move off tensor flow.
Hi @JuanVargas ,
For GPU package you need to ensure the installation of CUDA driver which can be verified with nvidia-smi command. Then you need to install TF-cuda package with
pip install tensorflow[and-cuda]
which automatically installs required cuda/cudnn libraries.I have checked in colab and able to detect GPU.Please refer attached gist.
@niko247 undoubtedly true. It is crystal clear that TF 2.16.1 does not work with the simple
pip install tensorflow[and-cuda]
command to actually utilize CUDA locally and no relative guidelines where provided yet to resolve this.It seems practically impossible for someone owning a PC with CUDA-enabled GPU to perform deep learning experiments with TensorFlow version 2.16.1 and utilize his GPU locally without manually performing some extra steps not included (until today) in the official TensorFlow documentation of the standard installation procedure of TensorFlow for Linux users with GPUs at least as a temporal fix! That’s why I submitted the pull request in good faith and for the shake of all users as TensorFlow is "An Open Source Machine Learning Framework for Everyone".
Hope that the next patch version of TensorFlow will fix the bug as soon as possible!
Following on from the post by chaudharyachint08, I did the following to automate it on venv.
I editied
bin/activate
in the folder of my venv and added the two lines at the end of the file:Then while editing the same file, added the 2 unset lines inside the deactivate function (before the } closed curly bracket):
I had tested it by entering the 2 lines in the terminal and my GPU was detected, so this was just the automation when the venv is activated.
I have the same problem with Ubuntu 22.04.4 with the following environment:
tensorflow==2.16.1
Python 3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0] on linux
gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
nvcc --version
output:Hello everyone, Had the same problem here, Managed to solve it! Thanks to @njzjz , setting
TF_CPP_MAX_VLOG_LEVEL=3
shows more information:what I did, was searching for where cudnn is placed:
adding
/usr/lib/x86_64-linux-gnu
toLD_LIBRARY_PATH
solved my problem 😃Hope this helps
Not really a solution, but clarification and summary of what is happening so far. As per the investigations of @njzjz, we see that
libcublas
et al are loaded, whereaslibcudnn
is not. He also investigated inside a docker container and found that none of the libraries are loaded. He further updated the library path and found that the libraries were now loaded.I have tried on my end and came to the same conclusion. I think what’s happening is that I have installed
libcublas
,libcudart
, etcetera, from the Nvidia Fedora repo (which does not includelibcudnn
), which is why these libraries load but notlibcudnn
when I run afterpip install tensorflow[and-cuda]
. Even though all these libs exist in my venv lib dir. It seems that the libraries in/usr/local/cuda/lib64
are searched, but not the ones in the venv lib dir.@njzjz’s and @sgkouzias’s solutions further support this. Clearly the current workaround is to follow their path altering advice.
I hope that this is useful to somebody. Hopefully there is a fix soon.
Hi Soheil,
I tried your suggestion of having LD_LIBRARY_PATH added, and then install tf.2.16.1. I am happy to report that your suggestion appears to work. I say “appears” only because I tested if tf can detect the GPUs via the command
which returns a non-empty list, so I assume this will work with commands too.
I hope the TF team fixes the issue soon.
Thank you !
Juan E. Vargas
I am of the opinion that just doing RC0 and then final release is not good testing. I hope the 2.16 situation was just a one off situation, to save time (it took the same amount of time for 2.16 with just one RC and final as for older releases with 3 RCs, multiple vulnerability fixes, etc.). I am no longer in the TensorFlow team, just helping here and there on the GitHub issues and PRs.
Thanks for the workaround, but the path is wrong (conda is missing). Also the file name is arbitrary:
${CONDA_PREFIX}/etc/conda/activate.d/nvidia-lib-dirs.sh
. Also the first line can be simplified toand if you don’t have spaces in your environment path, the rest can be simplified to:
got it work 😃 first https://developer.nvidia.com/rdp/cudnn-archive?source=post_page-----bfbeb77e7c89--------------------------------
then download Local Installer for Ubuntu22.04 x86_64 (Deb)
unpack and install libcudnn8_8.9.7.29-1+cuda12.2_amd64.deb
@Sivabooshan congrats! However note that:
12.4
which is not compatible with TensorFlow version2.16.1
and that’s why he might need to install TensorFlow in a virtual environment so as to avoid downgrading CUDA and a potential systemic global pollution (if you install packages globally, they clutter your main Python installation and can potentially interfere with system processes. Virtual environments protect your system-wide environment from this),pip install tensorflow[and-cuda]
all required NVIDIA libraries are installed as well. You just need to configure manually the environment variables as appropriate in order to utilize them and run TensorFlow with GPU,2.16.1
and utilize GPU locally. That’s why I submitted the pull request in good faith and for the shake of all users as TensorFlow is “An Open Source Machine Learning Framework for Everyone”.Hope that the next patch version of TensorFlow will fix the bug as soon as possible!
There should also be instructions for venv users.
On Mon, Apr 8, 2024, 11:48 a.m. Sotiris Gkouzias @.***> wrote:
Yes, please!
On Fri, Apr 5, 2024 at 12:35 PM Sotiris Gkouzias @.***> wrote:
In colab and gpu VM and docker image, you have cuda cuda installed as a system lib. So tensorflow looks into /usr/lib and finds it. But the standard thing that has worked before 2.16, and the expexpeted thing from pip install tensorflow[and-cuda] is that tensorflow should also look into cuda libraries that were installed via pip. If it looks, it can find nvcc, cufft, cublas,… in there. The problem is that it just doesn’t consider them. Pytorch and tensorflow 2.15 do. You sure can install cuda as a system lib and 2.16 works, but it just would be unnecessary and impossible for users without admin rights. The env-var fix posted above just adds those pip installed cuda libraries to library path so tensorflow finds and uses them.
The fix is simple, you just needs to modify the logic that searches for cuda librariew. but it requires modifying cpp files and recompile tensorflow to be tested, which exceeds resources of most users.
On Fri, Apr 5, 2024, 5:49 a.m. Surya @.***> wrote:
@SuryanarayanaY Steps to reproduce:
gives this message:
I use rtx3080ti but I think it’s issue for all gpus nvidia-smi:
But installing tensorflow==2.15.1 with python 3.10 works without additional steps
It does help! I had the same problem (Ubuntu 22.04, tensorflow 2.16.1): “Cannot dlopen some GPU libraries…”, even though the NVIDIA drivers were properly installed. Tensorflow could not find the libcudnn.so.8 shared library. Thanks to SoheilKhatibi for providing the solution that worked for me. In my case, I found the shared library in my Python virtual environment (named "venv1), under /venv1/lib/python3.10/site-packages/nvidia/cudnn/lib/. So adding a line in .bashrc (“export LD_LIBRARY_PATH=” pointing to the proper path) solved the problem. If you do this, don’t forget to reload .bashr using: source ~/.bashrc
My next problem was that, even though my tensorflow Python code worked fine when run from the terminal, the same script executed within PyCharm still threw the “Cannot dlopen some GPU libraries” error. To solve this, you need to go to PyCharm’s Run menu and select Edit configurations. In the left panel select the script for which you want to solve the GPU error, and open the Environment variables (right panel). Add the user environment variable LD_LIBRARY_PATH and give it the value corresponding to the path you’ve put in .bashr. Save and run, and the error is gone in Pycharm too!
Wanted to drop in and thank everyone for the sleuthing done up to this point. I had no idea about the environmental variable up there that allows you to debug the loading of the libraries so easily. That was neat.
Anyway, to amalgamate all of the suggestions in the thread, you can build an
env_vars.sh
script withinenvs/<your environment>/etc/conda/activate.d/
in an anaconda-like install that looks like the following:You do indeed need the two
dirname
commands for TF 2.16. Similarly, you need to put the cuda_nvcc bin directory in the path so ptxas can be found. Doing it this way also generalizes to your specific conda environment. I’ve confirmed that you can perform the intro TF tutorial (https://www.tensorflow.org/tutorials/quickstart/beginner) with this workaround.Having said all that, the TF Linux pip install page badly needs an update. The CUDA Toolkit and CUDNN version numbers are out of date. Since
tensorflow[and-cuda]
is now the recommended way to install everything for the GPU, the recommendation to install the other libraries independently is now useless as far as I can tell.Success!! Ultimately, I located valid ptxas in
...lib/python3.10/site-packages/nvidia/cuda_nvcc/bin
and manually added the specified path to my environment variables only within the conda virtual environment I created for TensorFlow version 2.16.1 It works like a charm!It does not work unfortunately. It worked with TF<=2.15.1. You would need to test it on non-docker version, since that one actully installs cuda libraries in the container with apt, not pip. Colab is also not a good place to test, because if you do pip list, you can see that cuda libraries are not installed from pypi, but come from some other methods. For this, please test local linux or wsl installation that has only nvidia drivers, not anything else. run pip install tensorflow[and-cuda]==2.16.1in a conda / venv / … environment, and try to list the gpus. It will not work. Then test with 2.15.1 in another environment and see that it works.
@SuryanarayanaY @mihaimaruseac The least that Tensorflow team can do is to test and acknowledge the problem, or say it is not planned, or anything. Just ignoring the problem without even testing is not a supportive act at all.
I am having the same issue.
env: Ubuntu 22.04 + Python 3.10.13 + CUDA 12.4 + tensorflow 2.16.1
The thing is, JAX builds against multiple CUDA versions, whereas TF always pinned to just one version
I have given up on TensorRT. I guess I won’t be using it either.
Agreed. Installing TF has always been hit or miss and it seems that in the many years since I last used TF that hasn’t changed one bit.
I don’t actually use TensorRT, but I would check if the required .so file for it is visible to tensorflow. Maybe I would need to find the name of required file in tensorflow source code.
This actually doesn’t change the fact that the new tensorflow version should be tested by google team before release, or the bugs should be fixed. It seems they only care about having a working docker image, not anything else.
Thanks @sh-shahrokhi. I thought it was path related. Modified slightly to make it python version independent if you put it in your conda environment activation (
[environment]/etc/activate.d/env_vars.sh
).This is not a resolution as this post install step should not be necessary.
W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
I can’t seem to do similar tricks to resolve the TensorRT issues when installed similarly into the conda environment. Any ideas?
Hi Krzysztof
I visited the site https://developer.nvidia.com/rdp/cudnn-archive?source=post_page-----bfbeb77e7c89--------------------------------
where I found an entry listed as " Local Installer for UBuntu22.04 x86_64(Deb)" which I downloaded. Unfortunately what I got is a package named “cudnn-local-repo-ubuntu2204-8.9.7.29_1.0-1_amd64.deb” which is not the same as the name you suggest in your message, which is " libcudnn8_8.9.7.29-1+cuda12.2_amd64.deb"
I assume what you meant is to get the libcudnn8_8.9.7.29*amd64.deb and the cuda12.2_amd64.deb separately and install both.
I have CUDA 12.4. I will not go back to trying to make TF 2.16.1 work with older versions of CUDA (12.2 or 12.3) because sooner or later the TF team will have to produce a version with the updated version of CUDA. IMHO, rather than us wasting time going back in versions, the TF beak should invest time going forward to update TF to the current CUDA version.
Thank you, Juan
On Mon, Mar 11, 2024 at 5:30 AM Krzysztof Radzikowski < @.***> wrote:
Hello! I outlined this behavior in a duplicate ticket ( #65842 ). Torch also now installs its CUDA dependencies using the NVIDIA-managed pip packages. However, Torch doesn’t appear to require the LD_LIBRARY_PATH to be set for the linker, like TF still does. I assume this is because they’re manually sourcing libs from the venv. Is this functionality on the roadmap for TF?
Thank you!
Thanks! Downgrade to 2.15.1 works well for me too (with TF prob <=0.23).
Updated the respective pull request (pending review) yesterday. The fix was successfully tested today by @weavermech as well.
Added instructions needed to resolve the ptxas issue.
Updated the respective pull request (pending review) yesterday. The fix was successfully tested today by @weavermech as well.
I thank everyone here for leading me to a simple ‘fix’ for this issue…copy/paste. Sorry, but you guys are leaps and bounds above my understanding of Linux, tensorflow, and wsl2…I just wanted to tinker with a little ML using my rtx 3090. Since Windows apparently isn’t supported, here I am wondering what I’ve gotten myself into. If you are like me and can follow the most basic of instructions, here’s what I did after reading this thread…I used this command TF_CPP_MAX_VLOG_LEVEL=3 python -c “import tensorflow as tf;print(tf.config.list_physical_devices(‘GPU’))” and the output from that told me what libraries weren’t found. More importantly, it told me the one it did find. Because I use wsl2, Windows explorer was all I needed to find the unfindable library files, that were all there in their own nice little folders, including the one that was found. I simply copy/pasted all the files of the ones that weren’t found, into the same location of the one that was found. Worked for me, but one thing’s become abundantly clear…mileage may vary. Good luck. I can’t believe I spent two days trying to figure out all this command line stuff and bash, zsh, echo this and that…how is this even a thing?
Agree. We used to have the instructions for setting the cuda path to the environment variable
LD_LIBRARY_PATH
in earlier versions. I think either we need to add these to documentation or atleast has to add a note in pip install guide that user has to setup the path for his own environment.May be same instructions may not work for all environments and hence it might have discorded. Anyways adding a note on same in the installation guide should be must that can avoid confusions.
If anyone from here willing to contribute to add the required notes please feel free to help. The changes can be proposed here at this doc source which will be reflected in this page.
Wanted to stop by with an update. Tried a fresh install this morning via miniforge (anaconda) with python 3.11.8. Followed the instructions on the TF website with the simple
pip install tensorflow[and-cuda]
. Same thing – would not recognize the GPU. Adding in theenv_vars.sh
fix I noted above makes the GPU recognizable.Output from
nvidia-smi
:So, @SuryanarayanaY even with NVIDIA GPU driver version 545 still TF 2.16 could not find GPU devices. The bug should be reported and fixed.
Thanks for all hints. I have written all steps in my blog.
https://mobinshaterian.medium.com/use-gpu-in-tensorflow-on-ubuntu-22-04-f033e59cf5cb
With every year it becomes more and more complicated and difficult to setup tensorflow with nvidia-gpu support on windows. Following worked for me, which is a good example of very bad developer experience.
Install CUDA-Toolkit 12.4 on host machine
Setup Ubuntu 22.04.3 LTS from Windows Store and update it.
Yes try downgrading TensorFlow with
pip install tensorflow==2.15.0.post1
. At least this works fine for my Ubuntu20.04.Thanks for the update.
On Sat, Mar 16, 2024, 12:51 James Paul Turner @.***> wrote:
Seriously… I followed the instructions exactly. Yes I used tensorflow[and-cuda]. Under wsl and conda. I’m using vscode and it makes the conda environment in ./.conda Sticking with the previous pip release 2.15.1 then porting to plain pytorch and transformer.
On Fri, 15 Mar 2024, 15:09 Sotiris Gkouzias, @.***> wrote:
I believe that
tensorflow[and-cuda]
does not work as expected. For reproduction, I built a simple Docker image:Build and run it:
The output shows the libraries cannot be found, as the same as the above.
Executing
docker run --gpus all -e TF_CPP_MAX_VLOG_LEVEL=3 -e LD_DEBUG=libs test-tf216-cuda python -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"
will give more information, where I think the search path is entirely wrong. run.logActually you should have version 12.3.107, since it is defined in the extra [and-cuda]. Or did you install differently? Can you check which version you have with
pip show nvidia-cuda-nvcc-cu12
and withconda list nvidia-nvcc
? You should be able to update.Same issue to me. Is there any official solution for it? I am using WSL2 just today, I don’t want to find and add the PATH manually. I am giving up TF.
Thanks @mihaimaruseac! I had it pinned, also tried with the SHA digest but then realized I had unpinned tensorflow-text in my requirements and that was upgrading TF as well 😛
Again, you are on the tensorflow repo. You closed the wrong issue.
You were asked to close the issue on keras, because it is not a keras issue. This is tensorflow, where proper detection of the cuda libs is definitely still an issue. Please reopen.
I closed the issue because I was asked to do so by a TF/Keras team member, for the reasons that I stated in the closing comment. Just like you (Leigh) I am disappointed, and Idecided not to waste more time. There are strong reasons for some of us (users) to prefer using our own hardware. I hope the TF/keras people get that message. Thanks, Juan
In general, we used to test RC versions before release. For example, we used to have RC0, RC1 and RC2 for TF 2.9. This gave people and downstream teams enough time to test and report issues.
It seems that 2.16.1 only had an RC0 (for 2.16.0).
The release process is (was?) like this:
r2.17
)Overall, this process would take
number_of_RCs + 1
weeks with a possibility of a few more weeks of delay.However, for 2.16 release, although the branch was cut on Feb 8th, there has been only one RC. Most likely issues can be solved by a patch release
I’m not sure if this is the root cause, but I resolved my own issue which also surfaced as a “Cannot dlopen some GPU libraries.” error when trying to run
python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"
To resolve my issue, I followed the tested build versions here: https://www.tensorflow.org/install/source#gpu
and I needed to update my existing installations from cuDNN 9 -> 8.9 and CUDA 12.4->12.3
When you’re on an NVIDIA download page like this one for CUDA Toolkit, don’t just download the latest version. See previous versions by hitting “Archive of Previous CUDA Releases”
@JuanVargas can you try uninstalling your existing CUDA installation to a tested build configuration for TF 2.16 by downgrading to CUDA 12.3?
I followed this post to uninstall my existing cuda installation: https://askubuntu.com/questions/530043/removing-nvidia-cuda-toolkit-and-installing-new-one
@DiegoMont can you try upgrading your cuDNN to 8.9 and CUDA to 12.3?