tensorflow: Eager Execution error: Blas GEMM launch failed

Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template

System information

  • Have I written custom code: no
  • OS Platform and Distribution: Linux Ubuntu 16.04
  • TensorFlow installed from (source or binary): pip3 install tensorflow-gpu
  • TensorFlow version (use command below): v1.12.0-0-ga6d8ffae09 1.12.0
  • Python version: 3.5.2
  • CUDA/cuDNN version: CUDA 9.0, cudnn 7.4.2
  • GPU model and memory: GeForce RTX 2080 Ti

Describe the current behavior Crashes with error “Blas GEMM launch failed”

Describe the expected behavior Correctly print the matmul result

Code to reproduce the issue I was trying to use eager execution. I tried the following simple code

import tensorflow as tf
tf.enable_eager_execution()
print(tf.matmul([[1., 2.],[3., 4.]], [[1., 2.],[3., 4.]]))

Other eager mode code under at https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/eager/python/examples fails with the same error.

However, non eager mode code can correctly run.

Other info / logs output

2019-01-31 17:00:20.744826: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2019-01-31 17:00:21.150735: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.545
pciBusID: 0000:17:00.0
totalMemory: 10.73GiB freeMemory: 9.36GiB
2019-01-31 17:00:21.399702: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 1 with properties:
name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.545
pciBusID: 0000:65:00.0
totalMemory: 10.73GiB freeMemory: 10.53GiB
2019-01-31 17:00:21.399746: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0, 1
2019-01-31 17:00:21.906842: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-01-31 17:00:21.906877: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0 1
2019-01-31 17:00:21.906882: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N Y
2019-01-31 17:00:21.906886: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 1:   Y N
2019-01-31 17:00:21.907143: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 9026 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:17:00.0, compute capability: 7.5)
2019-01-31 17:00:21.907488: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 10167 MB memory) -> physical GPU (device: 1, name: GeForce RTX 2080 Ti, pci bus id: 0000:65:00.0, compute capability: 7.5)
2019-01-31 17:00:22.144957: E tensorflow/stream_executor/cuda/cuda_blas.cc:652] failed to run cuBLAS routine cublasSgemm_v2: CUBLAS_STATUS_EXECUTION_FAILED
Traceback (most recent call last):
  File "test.py", line 5, in <module>
    print(tf.matmul([[1., 2.],[3., 4.]], [[1., 2.],[3., 4.]]))
  File "/home/weixu/.local/lib/python3.5/site-packages/tensorflow/python/ops/math_ops.py", line 2057, in matmul
    a, b, transpose_a=transpose_a, transpose_b=transpose_b, name=name)
  File "/home/weixu/.local/lib/python3.5/site-packages/tensorflow/python/ops/gen_math_ops.py", line 4586, in mat_mul
    _six.raise_from(_core._status_to_exception(e.code, message), None)
  File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.InternalError: Blas GEMM launch failed : a.shape=(2, 2), b.shape=(2, 2), m=2, n=2, k=2 [Op:MatMul] name: MatMul/

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 44 (7 by maintainers)

Most upvoted comments

Can you please kill all running notebooks which utilize your GPU. Then restart the kernel and execute the code again?

I’ve just copied cublas64_10.dll to cublas64_100.dll, and it worked 😃

The following what I found,

  • IF you have RTX 20 series (ex. 2070 or 2080), you need to use CUDA 10.0 with CuDNN 7.5 and TF > 1.12.
  • IF you use CUDA 10.1, you must use CuDNN 7.6
  • IF you have RTX 20 series and still wants to use CUDA 9.0 and CuDNN 7.5, you must use TF <= 1.12

Also, IF you have RTX 20 series and CUDA 10 you must put this in your code

tf_config = tf.ConfigProto()
tf_config.gpu_options.allow_growth = True
tf_config.gpu_options.per_process_gpu_memory_fraction = 0.9
tf_config.allow_soft_placement = True

@Edremelech Thank you! After renaming cublas64_10.dll to cublas64_100.dll, my program runs as well.

Issue solved with tf-nightly-gpu and CUDA 10

IF running an RTX series check first if TF2 is using Cuda 10 Then call set_memory_growth and done!

physical_devices = tf.config.list_physical_devices('GPU')
try:
  tf.config.experimental.set_memory_growth(physical_devices[0], True)
  assert tf.config.experimental.get_memory_growth(physical_devices[0])
except:
  pass

I’ve just copied cublas64_10.dll to cublas64_100.dll, and it worked 😃

Think you for your reply,where do you find cublas64_10.dll and cublas64_100,dll?I do not meet it.

My problem is solved. Tensorflow failed to load cublas64_100.dll because it was called cublas64_10.dll. I am simply shocked to encounter such errors. Anyway, thanks everybody, and I hope my stupid messages will help another newb that can’t believe in DLL names change 😃

I am almost a hundred percent sure the GPU truly running out of memory. It doesn’t even run for one epoch.

The same code, same dataset, same everything, runs fine in 1080ti, the GPUs of collab, 1080, 1070.

My feeling tells me something is wrong with CUDA 9.0 on 2080 ti or eager on 2080ti with CUDA 9.0.