tensorflow: cuDNN, cuFFT, and cuBLAS Errors
Issue type
Bug
Have you reproduced the bug with TensorFlow Nightly?
Yes
Source
source
TensorFlow version
GIT_VERSION:v2.14.0-rc1-21-g4dacf3f368e VERSION:2.14.0
Custom code
No
OS platform and distribution
WSL2 Linux Ubuntu 22
Mobile device
No response
Python version
3.10, but I can try different versions
Bazel version
No response
GCC/compiler version
No response
CUDA/cuDNN version
CUDA version: 11.8, cuDNN version: 8.7
GPU model and memory
NVIDIA Geforce GTX 1660 Ti, 8GB Memory
Current behavior?
When I run the GPU test from the TensorFlow install instructions, I get several errors and warnings. I don’t care about the NUMA stuff, but the first 3 errors are that TensorFlow was not able to load cuDNN. I would really like to be able to use it to speed up training some RNNs and FFNNs. I do get my GPU in the list of physical devices, so I can still train, but not as fast as with cuDNN.
Standalone code to reproduce the issue
python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"
Relevant log output
2023-10-09 13:36:23.355516: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-10-09 13:36:23.355674: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-10-09 13:36:23.355933: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-10-09 13:36:23.413225: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-10-09 13:36:25.872586: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:880] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-10-09 13:36:25.916952: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:880] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-10-09 13:36:25.917025: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:880] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
About this issue
- Original URL
- State: open
- Created 9 months ago
- Reactions: 59
- Comments: 124 (4 by maintainers)
Commits related to this issue
- tensorflow unreferenced thus removed this causes the errors noted in https://github.com/tensorflow/tensorflow/issues/62075 but armory doesn't in fact use them — committed to twosixlabs/armory-library by mwartell 5 months ago
- feat: Add pillow and create different environment files for windows and wsl - "environment-win" holds is a working env with Tensorflow 2.15 but only with CPU support (as GPU on bare Windows is not su... — committed to Korred/unet-pp by Korred 4 months ago
- feat: Add pillow and create different environment files for windows and wsl - "environment-win" holds a working env with Tensorflow 2.15 but only with CPU support (as GPU on bare Windows is not suppo... — committed to Korred/unet-pp by Korred 4 months ago
Hello,
I’m experiencing the same issue, even though I meticulously followed all the instructions for setting up CUDA 11.8 and CuDNN 8.7. The error messages I’m encountering are as follows:
Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered. Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered. Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered.
I’ve tried this with different versions of Python. Surprisingly, when I used Python 3.11, TensorFlow 2.13 was installed without these errors. However, when I used Python 3.10 or 3.9, I ended up with TensorFlow 2.14 and the aforementioned errors.
I’ve come across information suggesting that I may not need to manually install CUDA and CuDNN, as [and-cuda] should handle the installation of these components automatically.
Could someone please guide me on the correct approach to resolve this issue? I’ve tried various methods, but unfortunately, none of them have yielded a working solution.
P.S. I’m using conda in WSL 2 on Windows 11.
i’m dying, this issue kills my career
I also have the same issue, and this seems not to be due to cuda environment as I rebulid cuda and cudnn to make them suit for tf-2.14.0.
This is log out I find:
python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"
2023-10-11 18:21:57.387396: I tensorflow/core/util/port.cc:111] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable
TF_ENABLE_ONEDNN_OPTS=0. 2023-10-11 18:21:57.415774: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered 2023-10-11 18:21:57.415847: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 2023-10-11 18:21:57.415877: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2023-10-11 18:21:57.421400: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-10-11 18:21:58.155058: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT 2023-10-11 18:21:59.113217: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:880] could not open file to read NUMA node: /sys/bus/pci/devices/0000:65:00.0/numa_node Your kernel may have been built without NUMA support. 2023-10-11 18:21:59.152044: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:880] could not open file to read NUMA node: /sys/bus/pci/devices/0000:65:00.0/numa_node Your kernel may have been built without NUMA support. 2023-10-11 18:21:59.152153: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:880] could not open file to read NUMA node: /sys/bus/pci/devices/0000:65:00.0/numa_node Your kernel may have been built without NUMA support. [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
@picobyte I am already using a venv. And system packages are not related to this bug 😉. Tensorflow is linking twice with the same object; not my system packages.
It turns out that
bazel-out/k8-opt-exec-50AE0418/bin/tensorflow/compiler/xla/stream_executor/cuda/libcudnn_plugin.pic.lo
contains the duplicated symbol(s):and
bazel aquery 'mnemonic("CppLink", //tensorflow:libtensorflow_cc.so.2.14.0)' --output=text
shows that it linked with libtensorflow_cc.After a whole-day-long investigation, I found the following comment in
tensorflow/BUILD
:So, apparently
ddunleavy
already knew about this bug.@SuryanarayanaY I tried several times, reinstalling Ubuntu, but it still doesn’t work.
Spent 5 hours on this, the register factory issue was resolved with:
$ pip uninstall tensorflow $ python3 -m pip install tf-nightly[and-cuda]
But first verify that you’ve setup cuDNN and nvcc correctly first, otherwise its not tf’s problem.
Now I have the NUMA issue remaining, gonna go figure that out next.
@ymodak This piece of shit is still unresolved in the latest version. WHY and HOW could you remove the bug label !? That DOES NOT MAKE SENSE! HOW INSANE! I am confused if your test cases have covered this issue or not?
I have found a workaround that seems to be effective. By installing the following specific versions of TensorFlow and its related packages, the issue is resolved:
These versions work well together and avoid the problem encountered with TensorFlow 2.14.0.
According to @Romeo-CC in https://github.com/tensorflow/tensorflow/issues/62002#issuecomment-1800718221: The issue was present in versions 2.10 and 2.11, resolved in 2.12, but reemerged in 2.14.
Hi Tensorflow maintainers, can this please be fixed? It works works with GPU for 2.13 but not for 2.15 any longer (I haven’t tried 2.14, but it others above seem to have done it):
Tensorflow==2.13.1 without any issue. (Environment: Ubuntu 22.04.2 LTS WSL2, cuda=11.8, cndnn=8.9.6)
python3.11 -m pip install tensorflow==2.13.1
I installed 2.15 with no luck, then tried 2.14.1 with errors still there.
CUDA
https://developer.nvidia.com/cuda-11-8-0-download-archive?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=22.04&target_type=deb_network
Installation Instructions:
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.0-1_all.deb sudo dpkg -i cuda-keyring_1.0-1_all.deb sudo apt-get update sudo apt-get -y install cuda-11-8
setup your paths:
echo ‘export PATH=/usr/local/cuda-11.8/bin:$PATH’ >> ~/.bashrc echo ‘export LD_LIBRARY_PATH=/usr/local/cuda-11.8/lib64:$LD_LIBRARY_PATH’ >> ~/.bashrc source ~/.bashrc sudo ldconfig
cnDNN
https://developer.nvidia.com/rdp/cudnn-download
Download cuDNN v8.9.6 (November 1st, 2023), for CUDA 11.x
wget https://developer.nvidia.com/downloads/compute/cudnn/secure/8.9.6/local_installers/11.x/cudnn-linux-x86_64-8.9.6.50_cuda11-archive.tar.xz
sudo tar -xvf cudnn-linux-x86_64-8.9.6.50_cuda11-archive.tar.xz sudo mv cudnn-linux-x86_64-8.9.6.50_cuda11-archive cuda
copy the following files into the cuda toolkit directory:
sudo cp -P cuda/include/cudnn*.h /usr/local/cuda-11.8/include sudo cp -P cuda/lib/libcudnn* /usr/local/cuda-11.8/lib64/ sudo chmod a+r /usr/local/cuda-11.8/lib64/libcudnn*
Verify Installation:
nvidia-smi nvcc -V
For NUMBA
Add these lines at the end of your .bashrc:
export LD_LIBRARY_PATH=“/usr/lib/wsl/lib/” export NUMBA_CUDA_DRIVER=“/usr/lib/wsl/lib/libcuda.so.1”
Hi @Ke293-x2Ek-Qe-7-aE-B ,
Starting from TF2.14 tensorflow provides CUDA package which can install all the cuDNN,cuFFT and cubLas libraries.
You can use
pip install tensorflow[and-cuda]
command for that.Please try this command let us know if it helps. Thankyou!
It’s been five months, yet the problem remains.
Installing 2.16.0-dev20231212 with
tf-nightly[and-cuda]
resolved the issue for me.PS The error is still also in 2.15.
@ymodak How is it not a bug when you add the same global variable to two different shared libraries that have to be used together in most of the cases, causing the same initialization code to be run twice? This was never the intention, and is inherently wrong. The fact that it seems like there is no ill effect coming from it, because the second call will be ignored, does not convince me it should be left like this: there might be other duplicates. We’re talking clearly about the “Initialization order fiasco” here or something of the same level of badness.
I have a first try at solving this that should hopefully land soon, waiting on review internally. Sorry for the delay here
tf 2.15, followed install guide on wsl2, same situation
Can confirm, with tf 2.15, and a complete system update, these errors persist, but do not appear to limit any functionality or performance.
The message suggests the factories in question are already present:
Attempting to register factory for plugin cuDNN when one has already been registered
I don’t understand why the bug label was removed, even though it’s still a bug and an error. Not to mention, there’s a new version of TensorFlow 2.15 where it should have been fixed there, but it remains unresolved.
When 2.15.X didn’t work, tensorflow 2.16.1 (without CUDA) solved this issue for me. Python3.10, CUDA driver 12.2, Cuda Toolkit 12.1, cuDNN 8.9.5.
pip uninstall tensorflow && pip install tf-nightly[and-cuda]
@Romeo-CC @CarloWood Sorry for the misunderstanding. I see that this issue is related to
type:build/install
and as per issue triage workflow we limit the type label usage to one label per issue for internal tracking purposes hence I removedtype:bug
label . Having said that I acknowledge that this is an active problem and have notified the relevant team.Same issue
Just a quick heads up @maulberto3, when you use
nvidia-smi
, the CUDA version you see in the top right corner is the highest version of CUDA compatible with your Nvidia driver, and not actually the currently installed CUDA version. Good to know since that’s confused me in the past.I did delve deeper into this and found the following: Tensorflow 2.14.0 is initializing the above three libraries two times. The first time from here:
and
The second time from here:
and
Got similar error on Ubuntu 22.04.3 LTS 😦
Can confirm it works with 2.16.1. For those who have to resort to using 2.9.0 workaround (some of my packages are limited to 2.15), use python <= 3.10 to install it.
It finally worked, with Tensorflow 2.16.1 (upgrade to lastest) >
pip install --upgrade tensorflow
This is not working either.
Same error on Ubuntu 22.04 LTS Install in WSL2 / Windows 11. Has anyone found solution to this?
Works perfectly: You can check with what version works your tensorflow with this code
I am having the same issue as FaisalAlj above, on Windows 10 with the same versions of CUDA and CuDNN. The package
tensorflow[and-cuda]
is not found by pip. I’ve tried different versions of python and tensorflow without success. In my case I’m using virtualenv rather than conda.Edit 1: I appear to be able to install
tensorflow[and-cuda]
as long as I use quotes around the package, like:pip install "tensorflow[and-cuda]"
.Edit 2: I still appear to be getting these messages however, so I’m not sure I’ve installed things correctly.
Yeah. But I just found that when I downgrade to 2.13.0 version, errors in register won’t appear again. It looks like this:
(TF) ephys3@ZhouLab-Ephy3:~$ python3 -c "import tensorrt as trt;import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"
Although I haven’t figured out how to solve NUMA node error, I found some clues from another issue (as I operated all above in WSL Ubuntu). This bug seems not to be significant as explaination from NVIDIA forums . So I guess errors in register might have something with the latest version and errors in NUMA might be caused by OS enviroment. Hope this information would help some guys.
@AthiemoneZero Because it still does output a GPU device at the bottom of the log, I am training on GPU, just without cuDNN. It will be slower, but it is better than nothing or training on CPU.
You are right, what a shame, I gave up and went to Rust.
NUMA non zero problem can be solved this way
01:00.0 VGA compatible controller: NVIDIA Corporation TU106 [GeForce RTX 2060 12GB] (rev a1) 01:00.1 Audio device: NVIDIA Corporation TU106 High Definition Audio Controller (rev a1) The first line shows the address of the VGA-compatible device, NVIDIA Geforce, as 01:00 . Each one will be different, so let’s change this part carefully. 2. Check and change NUMA setting values If you run ls with this path /sys/bus/pci/devicecs/, you can see the following list: ls /sys/bus/pci/devices/
0000:00:00.0 0000:00:06.0 0000:00:15.0 0000:00:1c.0 0000:00:1f.3 0000:00:1f.6 0000:02:00.0 0000:00:01.0 0000:00:14.0 0000:00:16.0 0000:00:1d.0 0000:00:1f.4 0000:01:00.0 0000:00:02.0 0000:00:14.2 0000:00:17.0 0000:00:1f.0 0000:00:1f.5 0000:01:00.1 01:00.0 checked above is visible. However, 0000: is attached in front. 3. Check if it is connected. cat /sys/bus/pci/devices/0000:01:00.0/numa_node
-1
1 means no connection, and 0 means connected. 4. Fix it with the command below. sudo echo 0 | sudo tee -a /sys/bus/pci/devices/0000:01:00.0/numa_node
0
Thank you for your comment @qnlzgl . I have attempted to fix the issue in various ways, but none have proven successful for me.
works for me, thank you very much
This has fixed the issue for me. For people dealing with errors using JAX and encounter
cuSolver ran into an error
. This fixes it.Tried with tf-nightly-2.16.0.dev20231219 Still the same issues Python 3.10.12 | wsl2 Ubuntu 18.04.5 LTS | windows11
@ddunl Since I’m not using master, I had to change the diff a little. I applied the attached patch to 2.14.0. After that everything still compiled and the duplicated registration of the plugins vanished from my ‘hello-world’ test program.
HOWEVER - verification with objdump shows that now neither
libtensorflow_framework.so.2.14.0
norlibtensorflow_cc.so.2.14.0
contains the global registration variable anymore!Is there a way to test if things still really work? Cause I have the feeling that now the plugins aren’t registered at all.
Great, when trying to attach
cuda_plugins.patch
I get the message: We don’t support that file type.Try again with GIF, JPEG, JPG, MOV, MP4, PNG, SVG, WEBM, CPUPROFILE, CSV, DMP, DOCX, FODG, FODP, FODS, FODT, GZ, JSON, JSONC, LOG, MD, ODF, ODG, ODP, ODS, ODT, PATCH, PDF, PPTX, TGZ, TXT, XLS, XLSX or ZIP.
I’ll paste it here, losing all required white-space requirements no doubt, but perhaps enough for inspection:
Same issue here on both WSL2 and Ubuntu.
Hi @Ke293-x2Ek-Qe-7-aE-B ,
I have checked the installation on colab(linx environment) and observed same logs as per attached gist.
These logs seems generated from XLA compiler but GPU is able to detectable. Similar issue #62002 and already bought to Engineering team attention.
CC: @learning-to-play
@Ke293-x2Ek-Qe-7-aE-B You’re welcomed. BTW, I also followed the instruction to configure development including suitable version of bazel and clang-16, just before all my operation digging into conda env.
@Ke293-x2Ek-Qe-7-aE-B Apologize for my misunderstanding. I did the same in installing cuda toolkit as what you described above before I went directly to debug tf_gpu. I made sure my gpu and cuda could perform well as I have tried another task smoothly using cuda but without tf. What I concerned is some dependencies of tf have to be pre-installed in a conda env and this might be treated by [and-cuda] (my naive guess