tensorflow: OP_REQUIRES failed at conv_ops.cc:1106 : Not found: No algorithm worked!
System information
- Linux Ubuntu 20.04
- TensorFlow installed from Docker tensorflow/tensorflow:2.4.0rc1
- TensorFlow version: 2.4.0rc2
- Python version: 3.6.9
- Installed using Docker
- CUDA/cuDNN version: CUDA 11.1 cuDNN v8
- GPU model and memory: RTX 3080 FE 10GB
Describe the problem While training custom resnet 50 model I get the following build error:
2020-11-20 12:05:01.826720: W tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at conv_ops.cc:1106 : Not found: No algorithm worked!
I don’t think the code has any issues. It works fine when training with CPU.
Any other info / logs
2020-11-20 12:04:55.291380: I tensorflow/core/profiler/lib/profiler_session.cc:136] Profiler session initializing.
2020-11-20 12:04:55.291414: I tensorflow/core/profiler/lib/profiler_session.cc:155] Profiler session started.
2020-11-20 12:04:55.291455: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1365] Profiler found 1 GPUs
2020-11-20 12:04:55.360280: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcupti.so.11.0
2020-11-20 12:04:55.491657: I tensorflow/core/profiler/lib/profiler_session.cc:172] Profiler session tear down.
2020-11-20 12:04:55.491780: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1487] CUPTI activity buffer flushed
2020-11-20 12:04:56.592756: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
2020-11-20 12:04:56.610956: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 3899970000 Hz
Epoch 1/30
2020-11-20 12:04:58.010569: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2020-11-20 12:04:58.802284: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11
2020-11-20 12:04:58.807134: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2020-11-20 12:05:01.826720: W tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at conv_ops.cc:1106 : Not found: No algorithm worked!
Traceback (most recent call last):
File "custom_resnet.py", line 131, in <module>
train_model()
File "custom_resnet.py", line 105, in train_model
callbacks=[tensorboard_callback]
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py", line 1100, in fit
tmp_logs = self.train_function(iterator)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py", line 828, in __call__
result = self._call(*args, **kwds)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py", line 888, in _call
return self._stateless_fn(*args, **kwds)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 2943, in __call__
filtered_flat_args, captured_inputs=graph_function.captured_inputs) # pylint: disable=protected-access
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 1919, in _call_flat
ctx, args, cancellation_manager=cancellation_manager))
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 560, in call
ctx=ctx)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/execute.py", line 60, in quick_execute
inputs, attrs, num_outputs)
tensorflow.python.framework.errors_impl.NotFoundError: No algorithm worked!
[[node model/conv1/Conv2D (defined at custom_resnet.py:105) ]] [Op:__inference_train_function_8452]
Function call stack:
train_function
2020-11-20 12:05:01.905250: W tensorflow/core/kernels/data/generator_dataset_op.cc:107] Error occurred when finalizing GeneratorDataset iterator: Failed precondition: Python interpreter state is not initialized. The process may be terminated.
[[{{node PyFunc}}]]
nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.32.00 Driver Version: 455.32.00 CUDA Version: 11.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce RTX 3080 On | 00000000:2B:00.0 On | N/A |
| 0% 43C P8 25W / 320W | 857MiB / 9995MiB | 1% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Wed_Jul_22_19:09:09_PDT_2020
Cuda compilation tools, release 11.0, V11.0.221
Build cuda_11.0_bu.TC445_37.28845127_0
tf.test.is_gpu_available()
WARNING:tensorflow:From <stdin>:1: is_gpu_available (from tensorflow.python.framework.test_util) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.config.list_physical_devices('GPU')` instead.
2020-11-20 12:10:11.234638: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
2020-11-20 12:10:11.235502: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2020-11-20 12:10:11.269174: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-20 12:10:11.269569: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties:
pciBusID: 0000:2b:00.0 name: GeForce RTX 3080 computeCapability: 8.6
coreClock: 1.71GHz coreCount: 68 deviceMemorySize: 9.76GiB deviceMemoryBandwidth: 707.88GiB/s
2020-11-20 12:10:11.269584: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2020-11-20 12:10:11.271142: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2020-11-20 12:10:11.271167: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11
2020-11-20 12:10:11.271830: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2020-11-20 12:10:11.271954: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2020-11-20 12:10:11.273538: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2020-11-20 12:10:11.273878: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
2020-11-20 12:10:11.273963: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2020-11-20 12:10:11.274040: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-20 12:10:11.274432: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-20 12:10:11.274959: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
2020-11-20 12:10:11.274975: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2020-11-20 12:10:11.593266: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-11-20 12:10:11.593303: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267] 0
2020-11-20 12:10:11.593309: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 0: N
2020-11-20 12:10:11.593483: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-20 12:10:11.593857: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-20 12:10:11.594195: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-20 12:10:11.594517: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/device:GPU:0 with 8743 MB memory) -> physical GPU (device: 0, name: GeForce RTX 3080, pci bus id: 0000:2b:00.0, compute capability: 8.6)
True
Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 20 (10 by maintainers)
try adding this just after importing everthing. physical_devices = tf.config.list_physical_devices(‘GPU’) tf.config.experimental.set_memory_growth(physical_devices[0], True)
I experienced this issue on an MSI GL65 with an RTX2070 on Ubuntu 20.04.
Dynamic libraries are the following:
Adding the lines indicated by @king398 solved my issue.
same issue with nvidia/cuda:11.0-cudnn8-devel-ubuntu18.04 and RTX 3080
using cuda 11.1 cause :
Could not load dynamic library 'libcusolver.so.10'; dlerror: libcusolver.so.10: cannot open shared object file: No such file or directorytried with rc0 -> rc4
Edit : Fixed
docker image : nvidia/cuda:11.1-cudnn8-devel-ubuntu18.04 tf version : tf-nightly-gpu
Need to change LD_LIBRARY_PATH in order to make simlink
ENV LD_LIBRARY_PATH=/usr/local/cuda-11.1/targets/x86_64-linux/libMake simlink so libcusolver.so.10 is defined
RUN ln -s /usr/local/cuda-11.1/targets/x86_64-linux/lib/libcusolver.so.11 /usr/local/cuda-11.1/targets/x86_64-linux/lib/libcusolver.so.10if you have cublas error you can try this :
TF 2.4 is built & tested against CUDA 11.0, not 11.1.
I’ve found a temporary solution by using software provided by lambda stack. It works on ubuntu 20.04 for all RTX 30 series GPUs.