tensorflow: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice.
Click to expand!
Issue Type
Bug
Source
binary
Tensorflow Version
2.11.0
Custom Code
Yes
OS Platform and Distribution
Linux Ubuntu 22.04.01
Mobile device
No response
Python version
3.9.15
Bazel version
No response
GCC/Compiler version
No response
CUDA/cuDNN version
CUDA 11.2, cuDNN 8.1.0
GPU model and memory
Nvidia GTX 1060 6gb
Current Behaviour?
I was installing tensorflow according to this guide https://www.tensorflow.org/install/pip and ran into the error. I am running a fresh install of ubuntu. I have tried `export XLA_FLAGS=--xla_gpu_cuda_data_dir=/path/to/cuda` to no avail.
Standalone code to reproduce the issue
model = tf.keras.models.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(10, activation='softmax')
])
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
model.fit(x_train, y_train, epochs=5) #fails here
Relevant log output
Epoch 1/5
2022-11-24 23:30:47.064919: I tensorflow/compiler/xla/service/service.cc:173] XLA service 0x7f4ce3255ca0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2022-11-24 23:30:47.064946: I tensorflow/compiler/xla/service/service.cc:181] StreamExecutor device (0): NVIDIA GeForce GTX 1060 6GB, Compute Capability 6.1
2022-11-24 23:30:47.068586: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:268] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
2022-11-24 23:30:47.086148: W tensorflow/compiler/xla/service/gpu/nvptx_helper.cc:56] Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice. This may result in compilation or runtime failures, if the program we try to run uses routines from libdevice.
Searched for CUDA in the following directories:
./cuda_sdk_lib
/usr/local/cuda-11.2
/usr/local/cuda
.
You can choose the search directory by setting xla_gpu_cuda_data_dir in HloModule's DebugOptions. For most apps, setting the environment variable XLA_FLAGS=--xla_gpu_cuda_data_dir=/path/to/cuda will work.
2022-11-24 23:30:47.087159: W tensorflow/compiler/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:326] libdevice is required by this HLO module but was not found at ./libdevice.10.bc
2022-11-24 23:30:47.087339: I tensorflow/compiler/jit/xla_compilation_cache.cc:477] Compiled cluster using XLA! This line is logged at most once for the lifetime of the process.
2022-11-24 23:30:47.087434: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at xla_ops.cc:446 : INTERNAL: libdevice not found at ./libdevice.10.bc
2022-11-24 23:30:47.106008: W tensorflow/compiler/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:326] libdevice is required by this HLO module but was not found at ./libdevice.10.bc
2022-11-24 23:30:47.106292: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at xla_ops.cc:446 : INTERNAL: libdevice not found at ./libdevice.10.bc
2022-11-24 23:30:47.125456: W tensorflow/compiler/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:326] libdevice is required by this HLO module but was not found at ./libdevice.10.bc
2022-11-24 23:30:47.125753: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at xla_ops.cc:446 : INTERNAL: libdevice not found at ./libdevice.10.bc
2022-11-24 23:30:47.144359: W tensorflow/compiler/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:326] libdevice is required by this HLO module but was not found at ./libdevice.10.bc
2022-11-24 23:30:47.144670: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at xla_ops.cc:446 : INTERNAL: libdevice not found at ./libdevice.10.bc
Output exceeds the size limit. Open the full output data in a text editor
---------------------------------------------------------------------------
InternalError Traceback (most recent call last)
Cell In [4], line 1
----> 1 model.fit(x_train, y_train, epochs=5)
File ~/miniconda3/envs/tf/lib/python3.9/site-packages/keras/utils/traceback_utils.py:70, in filter_traceback.<locals>.error_handler(*args, **kwargs)
67 filtered_tb = _process_traceback_frames(e.__traceback__)
68 # To get the full stack trace, call:
69 # `tf.debugging.disable_traceback_filtering()`
---> 70 raise e.with_traceback(filtered_tb) from None
71 finally:
72 del filtered_tb
File ~/miniconda3/envs/tf/lib/python3.9/site-packages/tensorflow/python/eager/execute.py:52, in quick_execute(op_name, num_outputs, inputs, attrs, ctx, name)
50 try:
51 ctx.ensure_initialized()
---> 52 tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
53 inputs, attrs, num_outputs)
54 except core._NotOkStatusException as e:
55 if name is not None:
InternalError: Graph execution error:
Detected at node 'StatefulPartitionedCall_2' defined at (most recent call last):
File "/home/nathan/miniconda3/envs/tf/lib/python3.9/runpy.py", line 197, in _run_module_as_main
...
File "/home/nathan/miniconda3/envs/tf/lib/python3.9/site-packages/keras/optimizers/optimizer_experimental/optimizer.py", line 1211, in apply_grad_to_update_var
return self._update_step_xla(grad, var, id(self._var_key(var)))
Node: 'StatefulPartitionedCall_2'
libdevice not found at ./libdevice.10.bc
[[{{node StatefulPartitionedCall_2}}]] [Op:__inference_train_function_766]
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Reactions: 20
- Comments: 33 (2 by maintainers)
Commits related to this issue
- updated git ignore Due to tensoflow install issue: https://github.com/tensorflow/tensorflow/issues/58681 Added libdevice files to local directoy. — committed to DrakonianMight/ML_1dSpec by DrakonianMight a year ago
Hello all,
I just want to reiterate the issue and give a possibly new solution. I’m having the problem described in this thread when following the step-by-step guide, and I’ll try to lay everything down here to make it clear and reproducible.
System Info
The system does not have CUDA installed through any other means, such as apt, as the goal is to install it using conda as described in the installation guide.
Installation Procedure
Taken directly from the pip installation guide page.
When I run the verification scripts from the installation guide the GPU is detected and it runs successfully.
Output:
Output:
Issue
The issue arises when trying to train a model. The following script, taken from the overview page, has the issue.
Output:
Solutions so Far
As others have stated, going back to Tensorflow 2.10 avoids the issue. If I run the exact same installation process described above, but run
pip install tensorflow==2.10
instead, the sample training script runs on my GPU without issue. That is a viable solution for the time being, but of course not ideal.When trying to fix the 2.11 installation by the methods described in this thread I had the same issue as @frankcaoyun. I set
XLA_FLAGS
with the following script.This fixes the previous issue, but a new one arises when trying to run the training script.
Output:
The standout lines are
disabling MLIR crash reproducer, set env var MLIR_CRASH_REPRODUCER_DIRECTORY to enable
andCouldn't invoke ptxas --version
. I tried settingMLIR_CRASH_REPRODUCER_DIRECTORY
, and the output remains the same, just without thedisabling MLIR crash reproducer
line.Possible Fix
The rest of the errors are all related to ptxas. I’m unsure of what the issue is exactly, as I’m not familiar with ptxas. But if I try to run
ptxas --version
then indeed the output isCommand 'ptxas' not found
. Now, that also happens in my environment with Tensorflow 2.10, but the code still runs fine on the GPU.Regardless, I attempted to fix the issue by installing ptxas. This post mentions the conda-forge package cudatoolkit-dev, so I tried installing ptxas by running:
And sure enough, now when I run run
ptxas --version
I get:And now the training script runs on my GPU!
This is mostly just a quick, not ideal, fix. I’m not entirely sure why version 2.11 seems to need ptxas installed while version 2.10 seems to do fine without it, but apparently this fixes the issue.
I think it is important for this issue to be properly fixed, either by changing Tensorflow 2.11’s code, the conda packages, or the installation guide. Following the step-by-step guide on the website should install version 2.11 without much issue, just like it is for version 2.10.
EDIT:
Minor update. Instead of running:
I ran:
And it also fixed the issue. Now
ptxas --version
returns:And the sample code runs on my GPU.
Hello all,
It seems like our solution made it to the official installation guide!
Now, at the bottom of the step-by-step instructions for Linux, there is the following section:
Of course, we can’t know for sure that the solution on the official page came from or was inspired by this thread, but I would certainly like to think so.
Fixed it to me:
sudo apt-get install cuda-minimal-build-11-8
Its only a 45MB download.My setup is a RTX 3060, WSL2 with the default ubuntu subsystem 22.04.2 LTS fresh installed, and following the installation guide in Tensorflow page.
Hi @SuryanarayanaY ,
There is a typo in your post. “–xla_gpu_cuda_data_dir” should be “–xla_gpu_cuda_data_dir”, double dashes instead of single dash. Someone copy-pasting the code will not get it work.
Despite that, after corrected the typo, I I followed the steps and encounter the errors below:
My linux machine is freshly installed with Ubuntu 22.04.1 LTS, with tensorflow=2.11.0 cudatoolkit=11.2 cudnn=8.1.0 (following the latest official installation guide)
I found out that I was having this issue because tensorflow/keras > 2.10 requires the cuda-compiler package to be installed for fitting models. Running
apt-get install cuda-compiler-11-8
creates the required libdevice directory in${CUDA_DIR}/nvvm/libdevice
.However, you do not have to install the entire cuda-toolkit package, which is enormous.
Hi @NSalberg ,@epetrovski, @chrsunwil ,
Inspired from a discussion at TF-Forum,you may resolve the issue by following these steps.
1.Create a folder
nvvm/libdevice
folder in the Conda environmentlib
folder.2.Copying the
libdevice.10.bc
file to the directorynvvm/libdevice
you may find this file in your system like the paths below :export XLA_FLAGS=–xla_gpu_cuda_data_dir=/home/miniconda3/envs/lib
The path/home/miniconda3/envs/lib
may be different for you it should trace for absolute path oflib
folder present inminiconda3
likeminiconda3/envs/lib
orminiconda/lib
Please try this and let us know if it works.
Thankyou!
I started experiencing this issue in a Docker container. The Docker image had not changed (ie. my CUDA setup) but I noticed that TensorFlow’s Keras dependency was updated to v. 2.11. I’ve locked Keras to v. 2.10 and now everything works again.
Hi all, Recently we noticed that latest Ubuntu OS installing CUDA library at
usr/lib/cuda
but Tensorflow expects it to be at/usr/local/cuda
as per Conda installation instructions and its worked so far.The commandwhereis cuda
will confirm the location of the CUDA library.The workaround is symlink using commandsudo ln -s /usr/lib/cuda /usr/local/cuda
.From the error log attached above for this ticket I observed the below log where the problem seems to be related to CUDA path.
Iam pretty confident that symlink as mentioned above will work for this case.
@frankcaoyun Can you try the proposed workaround and confirm if it works for you.
I’m pretty sure this is due to the fact that tensorflow 2.11 requires keras >= 2.11, wheras tensorflow 2.10 requires keras >= 2.10. This issue seems to be due to keras v. 2.11.
Hi @SuryanarayanaY,
To answer your questions, I installed the driver using Ubuntu’s “Additional Drivers” utility. The CLI route would be
sudo ubuntu-drivers install 525
. I upgraded to the 525 version of the driver to see if it would fix the issue, but it remains the same.And the output to
whereis cuda
is:But I think your line of inquiry is misguided. My driver installation method seems to work, as I haven’t had any other issues. I can run TF 2.10 and Pytorch on the GPU with no problem, and I can successfully run
nvidia-smi
and get a proper output. The driver installation step should really only need to install the driver itself, as that is the only component that NEEDS to be outside the virtual environment. Packages such as CUDA should be handled by Conda and should lie entirely inside the virtual environment, as seems to be the intention in the installation guide, as in the stepconda install -c conda-forge cudatoolkit=11.2 cudnn=8.1.0
.The issue seems to me to be with TF 2.11 and Conda’s CUDA installation. What I think may be happening is that you may not experience the issue if your method of installing the driver also installs CUDA or ptxas in your system. The Conda environment would still be defective and incomplete, but TF would use the system-wide CUDA or ptxas package and it would work, but this kind of defeats the purpose of a virtual environment.
I also noted that your output doesn’t have the
Couldn't invoke ptxas --version
error, which indicates that your installation procedure installed ptxas properly, and the lack of this package really seems to be at the core of the issue.Hi @SuryanarayanaY ,
Regarding “latest Ubuntu OS installing CUDA library at
usr/lib/cuda
”, may I confirm which source of installation you are referring to? Is it fromsudo apt-get
, ‘conda’ as per the tensorflow installation guide, or the.deb
or.run
from the nvidia cuda coolkit archive repo? This will help me test this solution, as I actually can’t find any CUDA folder under theusr/lib
directory.On a fresh Ubuntu 22.04 installation, if I only install the nvidia display driver, followed by installing cuda toolkit and cuda cnn as per the tensorflow installation guide, I will be able to do model inferencing using GPU, but not training. It will throw the CUDA directory error that you have attached.
What works for me at this moment: I found this post worked for me consistently (and I believe it should work on). I’m able to run both inferencing and training on GPU, without the
conda
installation forcudatoolkit
andcudnn
.Let me know what else I can do to help troubleshoot the issue. Thanks.
I had the same error. Setting up
XLA_FLAGS
made the error go away but did not actually fix the problem. Installingcuda-nvcc
in the conda environment fixed it.My setup for people coming to this thread in the future: Linux Mint, the version based on Ubuntu 22.04, Miniconda, tensorflow 2.13.1, python 3.11, CUDA 11.8, NVIDIA GeForce RTX 2060 SUPER, nvidia driver version: 535.129.03.
I am having the same issue as you are @timlac . I am on
Pop!_OS 22.04 LTS
. I hope someone figures out what is actually going on.Edit: After doing a lot of research, I was finally able to solve my issue. I followed the instructions from the official page for the initial setup. After that it was able to detect that I had GPU, but whenever I wanted to train a model, it would error out:
Which I fixed by following
Ubuntu 22.04
section from the official page. However, now I can only run the models from the shell and whenever I try to run the model meaning.py
or.ipynb
file from pycharm it again shows the previously mentioned errors (1. PyCharm not detecting the GPU, which I fixed by adding an environment variableLD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/{your_user}/miniconda3/envs/tf/lib/:/home/{your_user}/miniconda3/envs/tf/lib/python3.9/site-packages/nvidia/cudnn/lib
) and this fix is also mentioned in the previous posts as well. 2.Couldn't invoke ptxas --version
which I fixed by adding another environment variable in the configurationXLA_FLAGS=--xla_gpu_cuda_data_dir=/home/{your_username}/miniconda3/envs/tf/lib/
).The above two environment variables fixed the issue for PyCharm. At least for me 😃
I fixed the issue like this:
find / -name "libdevice.10.bc" 2>/dev/null
, in my case was in/opt/env/lib/python3.10/site-packages/triton/third_party/cuda/lib
find / -name "cuda" 2>/dev/null
, e.g. in my case its located in/usr/local/cuda
libdevice
library insidecuda
:ln -s /opt/env/lib/python3.10/site-packages/triton/third_party/cuda/lib /usr/local/cuda
I am using:
*I am using singularity containers for building this with cuda 11.8
Hi @SuryanarayanaY,
Thank you very much for the proposed solution. It’s quite difficult for me to test this out since my issue occurred in a production docker setup but I hope someone else will…
However, I could not help noticing that the solution requires elements from the cuda-toolkit package. Do you know if the intention is that tf.keras will require cuda-toolkit going forward? This was not required previously and cuda-toolkit is not even included in the official nvidia cuda runtime docker image, just the cuda runtime libraries.
I can confirm I installed the cuda files through those instructions.
Anyways, I was able to get TensorFlow running with my gpu by installing it in a new conda environment using this command
albeit not the latest version.
Hi @mohantym.
I’m actually having the same (at least very similar) issue on a fresh install of
Ubuntu 22.04
running on metal.Usin
tensorflow 2.11
I followed the guide here and can confirm that I did use
conda install -c conda-forge cudatoolkit=11.2 cudnn=8.1.0
.Tensorflow can find my gpu, but it has issues once it tries to train.
I’m using a
GTX 3060
with the 520 drivers.I was able to fix my issue by downgrading
tensorflow
to2.10
Let me know if any other information could be helpful.
Hi @NSalberg !
It looks like a Cuda set up issue. I could not replicate in Colab environment. Could you confirm that you have Installed cuda files (11.2/8.1) through Conda in a new environment as per this instruction and let us know.
Thank you!