DeepSpeech: "UnknownError: Failed to get convolution algorithm" from ./bin/run-ldc93s1.sh

Have I written custom code (as opposed to running examples on an unmodified clone of the repository): No custom code
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): PopOS (Ubuntu derivative) 18.10
TensorFlow installed from (our builds, or upstream TensorFlow): pip commands from DeepSpeech Readme: pip3 uninstall tensorflow && pip3 install 'tensorflow-gpu==1.13.1'
TensorFlow version (use command below): tensorflow-gpu 1.13.1
Python version: 3.7.1
Bazel version (if compiling from source):
GCC/Compiler version (if compiling from source):
CUDA/cuDNN version: cuda 10.0 / cudnn 7.5 (as specified in DeepSpeech readme)
GPU model and memory: GeForce RTX 2060, 5904MiB
Exact command to reproduce: ./bin/run-ldc93s1.sh (from DeepSpeech readme)

First off thanks to everyone working on DeepSpeech for a really awesome open source package.

I have the same issue as described here: https://github.com/mozilla/DeepSpeech/issues/2119 . That issue was closed with the instruction “please stick to Tensorflow recommended versions,” although the user specified they used Tensorflow 1.13.1 which is (currently at least) the TF version specified in the DeepSpeech Readme. This appears to be a bug to me because another user and I are both getting the same error, from running a DeepSpeech-provided bin script for retraining model, after installing DeepSpeech with the specified versions of TF/cuda/cudnn.

I have followed the DeepSpeech readme installation instructions carefully and have installed all requirements, including correct versions of cuda/cudnn. I can run DeepSpeech command to do voice-to-text inference successfully using the downloaded pretrained model, but when retraining that model (using DeepSpeech Readme’s “Training a Model” script: ./bin/run-ldc93s1.sh ), I get the following error:

tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. 
This is probably because cuDNN failed to initialize, so try looking to see if a warning log 
message was printed above.
	 [[{{node tower_0/conv1d/Conv2D}}]]

Full log / stacktrace:

(dsenv) mepstein@pop-os:~/DeepSpeech$ ./bin/run-ldc93s1.sh
+ [ ! -f DeepSpeech.py ]
+ [ ! -f data/ldc93s1/ldc93s1.csv ]
+ [ -d  ]
+ python -c from xdg import BaseDirectory as xdg; print(xdg.save_data_path("deepspeech/ldc93s1"))
+ checkpoint_dir=/home/mepstein/.local/share/deepspeech/ldc93s1
+ export CUDA_VISIBLE_DEVICES=0
+ python -u DeepSpeech.py --noshow_progressbar --train_files data/ldc93s1/ldc93s1.csv --test_files data/ldc93s1/ldc93s1.csv --train_batch_size 1 --test_batch_size 1 --n_hidden 100 --epochs 200 --checkpoint_dir /home/mepstein/.local/share/deepspeech/ldc93s1
WARNING:tensorflow:From /home/mepstein/DeepSpeech/dsenv/lib/python3.7/site-packages/tensorflow/python/data/ops/dataset_ops.py:429: py_func (from tensorflow.python.ops.script_ops) is deprecated and will be removed in a future version.
Instructions for updating:
tf.py_func is deprecated in TF V2. Instead, use
    tf.py_function, which takes a python function which manipulates tf eager
    tensors instead of numpy arrays. It's easy to convert a tf eager tensor to
    an ndarray (just call tensor.numpy()) but having access to eager tensors
    means `tf.py_function`s can use accelerators such as GPUs as well as
    being differentiable using a gradient tape.
    
WARNING:tensorflow:From /home/mepstein/DeepSpeech/dsenv/lib/python3.7/site-packages/tensorflow/python/data/ops/iterator_ops.py:358: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
WARNING:tensorflow:From /home/mepstein/DeepSpeech/dsenv/lib/python3.7/site-packages/tensorflow/contrib/rnn/python/ops/lstm_ops.py:696: to_int64 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
I Initializing variables...
I STARTING Optimization
I Training epoch 0...
Traceback (most recent call last):
  File "/home/mepstein/DeepSpeech/dsenv/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1334, in _do_call
    return fn(*args)
  File "/home/mepstein/DeepSpeech/dsenv/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/home/mepstein/DeepSpeech/dsenv/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
	 [[{{node tower_0/conv1d/Conv2D}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "DeepSpeech.py", line 829, in <module>
    tf.app.run(main)
  File "/home/mepstein/DeepSpeech/dsenv/lib/python3.7/site-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "DeepSpeech.py", line 813, in main
    train()
  File "DeepSpeech.py", line 510, in train
    train_loss, _ = run_set('train', epoch, train_init_op)
  File "DeepSpeech.py", line 483, in run_set
    feed_dict=feed_dict)
  File "/home/mepstein/DeepSpeech/dsenv/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 929, in run
    run_metadata_ptr)
  File "/home/mepstein/DeepSpeech/dsenv/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1152, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/mepstein/DeepSpeech/dsenv/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1328, in _do_run
    run_metadata)
  File "/home/mepstein/DeepSpeech/dsenv/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1348, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
	 [[node tower_0/conv1d/Conv2D (defined at DeepSpeech.py:56) ]]

Caused by op 'tower_0/conv1d/Conv2D', defined at:
  File "DeepSpeech.py", line 829, in <module>
    tf.app.run(main)
  File "/home/mepstein/DeepSpeech/dsenv/lib/python3.7/site-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "DeepSpeech.py", line 813, in main
    train()
  File "DeepSpeech.py", line 400, in train
    gradients, loss = get_tower_results(iterator, optimizer, dropout_rates)
  File "DeepSpeech.py", line 253, in get_tower_results
    avg_loss = calculate_mean_edit_distance_and_loss(iterator, dropout_rates, reuse=i > 0)
  File "DeepSpeech.py", line 186, in calculate_mean_edit_distance_and_loss
    logits, _ = create_model(batch_x, batch_seq_len, dropout, reuse=reuse)
  File "DeepSpeech.py", line 119, in create_model
    batch_x = create_overlapping_windows(batch_x)
  File "DeepSpeech.py", line 56, in create_overlapping_windows
    batch_x = tf.nn.conv1d(batch_x, eye_filter, stride=1, padding='SAME')
  File "/home/mepstein/DeepSpeech/dsenv/lib/python3.7/site-packages/tensorflow/python/util/deprecation.py", line 574, in new_func
    return func(*args, **kwargs)
  File "/home/mepstein/DeepSpeech/dsenv/lib/python3.7/site-packages/tensorflow/python/util/deprecation.py", line 574, in new_func
    return func(*args, **kwargs)
  File "/home/mepstein/DeepSpeech/dsenv/lib/python3.7/site-packages/tensorflow/python/ops/nn_ops.py", line 3482, in conv1d
    data_format=data_format)
  File "/home/mepstein/DeepSpeech/dsenv/lib/python3.7/site-packages/tensorflow/python/ops/gen_nn_ops.py", line 1026, in conv2d
    data_format=data_format, dilations=dilations, name=name)
  File "/home/mepstein/DeepSpeech/dsenv/lib/python3.7/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
    op_def=op_def)
  File "/home/mepstein/DeepSpeech/dsenv/lib/python3.7/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/home/mepstein/DeepSpeech/dsenv/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 3300, in create_op
    op_def=op_def)
  File "/home/mepstein/DeepSpeech/dsenv/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 1801, in __init__
    self._traceback = tf_stack.extract_stack()

UnknownError (see above for traceback): Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
	 [[node tower_0/conv1d/Conv2D (defined at DeepSpeech.py:56) ]]

About this issue

Original URL
State: closed
Created 5 years ago
Comments: 40 (5 by maintainers)

Commits related to this issue

Document TF_FORCE_GPU_ALLOW_GROWTH Fixes #2211 — committed to lissyx/STT by deleted user 5 years ago
Document TF_FORCE_GPU_ALLOW_GROWTH Fixes #2211 — committed to reuben/STT by deleted user 5 years ago
Document TF_FORCE_GPU_ALLOW_GROWTH Fixes #2211 — committed to rcgale/DeepSpeech by deleted user 5 years ago

Most upvoted comments

Ok finally got it to work (still with CUDA_VISIBLE_DEVICES=0) by updating allow_growth as an environment variable. I added: os.environ['TF_FORCE_GPU_ALLOW_GROWTH'] = 'true' to top of DeepSpeech.py

…and now ./bin/run-ldc93s1.sh trains without errors.

MaxPowerWasTaken on Jun 26, 2019

Ok finally got it to work (still with CUDA_VISIBLE_DEVICES=0) by updating an environment variable. I added: os.environ['TF_FORCE_GPU_ALLOW_GROWTH'] = 'true' to top of DeepSpeech.py

…and now ./bin/run-ldc93s1.sh trains without errors.

Thank you! I’ve been dealing with this problem for a LONG time and this finally solved it.

werneric on Jun 26, 2019

Ok finally got it to work (still with CUDA_VISIBLE_DEVICES=0) by updating allow_growth as an environment variable. I added: os.environ['TF_FORCE_GPU_ALLOW_GROWTH'] = 'true' to top of DeepSpeech.py

…and now ./bin/run-ldc93s1.sh trains without errors.

Thanks !

BenXQ on Jun 26, 2019

Ok finally got it to work (still with CUDA_VISIBLE_DEVICES=0) by updating an environment variable. I added: os.environ['TF_FORCE_GPU_ALLOW_GROWTH'] = 'true' to top of DeepSpeech.py

…and now ./bin/run-ldc93s1.sh trains without errors.

So it would confirm it’s this allow_growth and just that your way of setting it was wrong. i’d like to understand better what that option does, and if we should use it.

lissyx on Jun 25, 2019