cml: cml workflow with gpu fails with LD_LIBRARY_PATH error
Using a cml GitHub workflow with docker://dvcorg/cml:0-dvc2-base1-gpu
container fails to utilise GPU due to LD_LIBRARY_PATH error:
2021-07-12 10:52:46.210323: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2021-07-12 10:52:47.495637: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-12 10:52:47.496273: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties:
pciBusID: 0000:00:1e.0 name: Tesla T4 computeCapability: 7.5
coreClock: 1.59GHz coreCount: 40 deviceMemorySize: 14.75GiB deviceMemoryBandwidth: 298.08GiB/s
2021-07-12 10:52:47.496789: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2021-07-12 10:52:47.496915: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcublas.so.10'; dlerror: libcublas.so.10: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2021-07-12 10:52:47.498149: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2021-07-12 10:52:47.498508: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2021-07-12 10:52:47.501456: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2021-07-12 10:52:47.501637: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcusparse.so.10'; dlerror: libcusparse.so.10: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2021-07-12 10:52:47.501773: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcudnn.so.7'; dlerror: libcudnn.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2021-07-12 10:52:47.501792: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1598] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 1
- Comments: 17 (6 by maintainers)
Thank you for your measured response @0x2b3bfa0, that was quite a confusing and unhelpful message!
Ah - my mistake, actually what I did was the following, actually the reverse of what is mentioned in that comment (I missed this). https://stackoverflow.com/a/67642774
https://github.com/tensorflow/tensorflow/issues/45848#issuecomment-829557887
This appears to work, but still uncertain why the error exists at all
Perhaps relevant, but why is tensorflow 2.5.0 asking for
libcusolver.so.11
if it doesn’t exist?https://stackoverflow.com/questions/63199164/how-to-install-libcusolver-so-11
🤔 Can you please try running
sudo find / -name 'libcusolver.*' 2>/dev/null
on yourtrain
step?Apologies for delay, tf 2.2.0 works with base0 image(!), but tf 2.5.0 fails with the following:
There is actually, the above is for 2.2.0 and the 2.5.0 had LD_LIBRARY_PATH errors of a slightly different kind, let me find it
@btjones-me, can you please try using
docker://dvcorg/cml:0-dvc2-base0-gpu
instead ofdocker://dvcorg/cml:0-dvc2-base1-gpu
for yourtrain
job? As per the documentation, we providebase0
images with CUDA 10 and CuDNN 7 for compatibility with Tensorflow versions lower than 2.4.