DeepSpeech: "UnknownError: Failed to get convolution algorithm" from ./bin/run-ldc93s1.sh
- Have I written custom code (as opposed to running examples on an unmodified clone of the repository): No custom code
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): PopOS (Ubuntu derivative) 18.10
- TensorFlow installed from (our builds, or upstream TensorFlow): pip commands from DeepSpeech Readme:
pip3 uninstall tensorflow && pip3 install 'tensorflow-gpu==1.13.1'
- TensorFlow version (use command below): tensorflow-gpu 1.13.1
- Python version: 3.7.1
- Bazel version (if compiling from source):
- GCC/Compiler version (if compiling from source):
- CUDA/cuDNN version: cuda 10.0 / cudnn 7.5 (as specified in DeepSpeech readme)
- GPU model and memory: GeForce RTX 2060, 5904MiB
- Exact command to reproduce:
./bin/run-ldc93s1.sh
(from DeepSpeech readme)
First off thanks to everyone working on DeepSpeech for a really awesome open source package.
I have the same issue as described here: https://github.com/mozilla/DeepSpeech/issues/2119 . That issue was closed with the instruction “please stick to Tensorflow recommended versions,” although the user specified they used Tensorflow 1.13.1 which is (currently at least) the TF version specified in the DeepSpeech Readme. This appears to be a bug to me because another user and I are both getting the same error, from running a DeepSpeech-provided bin script for retraining model, after installing DeepSpeech with the specified versions of TF/cuda/cudnn.
I have followed the DeepSpeech readme installation instructions carefully and have installed all requirements, including correct versions of cuda/cudnn. I can run DeepSpeech
command to do voice-to-text inference successfully using the downloaded pretrained model, but when retraining that model (using DeepSpeech Readme’s “Training a Model” script: ./bin/run-ldc93s1.sh
), I get the following error:
tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm.
This is probably because cuDNN failed to initialize, so try looking to see if a warning log
message was printed above.
[[{{node tower_0/conv1d/Conv2D}}]]
Full log / stacktrace:
(dsenv) mepstein@pop-os:~/DeepSpeech$ ./bin/run-ldc93s1.sh
+ [ ! -f DeepSpeech.py ]
+ [ ! -f data/ldc93s1/ldc93s1.csv ]
+ [ -d ]
+ python -c from xdg import BaseDirectory as xdg; print(xdg.save_data_path("deepspeech/ldc93s1"))
+ checkpoint_dir=/home/mepstein/.local/share/deepspeech/ldc93s1
+ export CUDA_VISIBLE_DEVICES=0
+ python -u DeepSpeech.py --noshow_progressbar --train_files data/ldc93s1/ldc93s1.csv --test_files data/ldc93s1/ldc93s1.csv --train_batch_size 1 --test_batch_size 1 --n_hidden 100 --epochs 200 --checkpoint_dir /home/mepstein/.local/share/deepspeech/ldc93s1
WARNING:tensorflow:From /home/mepstein/DeepSpeech/dsenv/lib/python3.7/site-packages/tensorflow/python/data/ops/dataset_ops.py:429: py_func (from tensorflow.python.ops.script_ops) is deprecated and will be removed in a future version.
Instructions for updating:
tf.py_func is deprecated in TF V2. Instead, use
tf.py_function, which takes a python function which manipulates tf eager
tensors instead of numpy arrays. It's easy to convert a tf eager tensor to
an ndarray (just call tensor.numpy()) but having access to eager tensors
means `tf.py_function`s can use accelerators such as GPUs as well as
being differentiable using a gradient tape.
WARNING:tensorflow:From /home/mepstein/DeepSpeech/dsenv/lib/python3.7/site-packages/tensorflow/python/data/ops/iterator_ops.py:358: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
WARNING:tensorflow:From /home/mepstein/DeepSpeech/dsenv/lib/python3.7/site-packages/tensorflow/contrib/rnn/python/ops/lstm_ops.py:696: to_int64 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
I Initializing variables...
I STARTING Optimization
I Training epoch 0...
Traceback (most recent call last):
File "/home/mepstein/DeepSpeech/dsenv/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1334, in _do_call
return fn(*args)
File "/home/mepstein/DeepSpeech/dsenv/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/home/mepstein/DeepSpeech/dsenv/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[{{node tower_0/conv1d/Conv2D}}]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "DeepSpeech.py", line 829, in <module>
tf.app.run(main)
File "/home/mepstein/DeepSpeech/dsenv/lib/python3.7/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "DeepSpeech.py", line 813, in main
train()
File "DeepSpeech.py", line 510, in train
train_loss, _ = run_set('train', epoch, train_init_op)
File "DeepSpeech.py", line 483, in run_set
feed_dict=feed_dict)
File "/home/mepstein/DeepSpeech/dsenv/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 929, in run
run_metadata_ptr)
File "/home/mepstein/DeepSpeech/dsenv/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1152, in _run
feed_dict_tensor, options, run_metadata)
File "/home/mepstein/DeepSpeech/dsenv/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1328, in _do_run
run_metadata)
File "/home/mepstein/DeepSpeech/dsenv/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1348, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[node tower_0/conv1d/Conv2D (defined at DeepSpeech.py:56) ]]
Caused by op 'tower_0/conv1d/Conv2D', defined at:
File "DeepSpeech.py", line 829, in <module>
tf.app.run(main)
File "/home/mepstein/DeepSpeech/dsenv/lib/python3.7/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "DeepSpeech.py", line 813, in main
train()
File "DeepSpeech.py", line 400, in train
gradients, loss = get_tower_results(iterator, optimizer, dropout_rates)
File "DeepSpeech.py", line 253, in get_tower_results
avg_loss = calculate_mean_edit_distance_and_loss(iterator, dropout_rates, reuse=i > 0)
File "DeepSpeech.py", line 186, in calculate_mean_edit_distance_and_loss
logits, _ = create_model(batch_x, batch_seq_len, dropout, reuse=reuse)
File "DeepSpeech.py", line 119, in create_model
batch_x = create_overlapping_windows(batch_x)
File "DeepSpeech.py", line 56, in create_overlapping_windows
batch_x = tf.nn.conv1d(batch_x, eye_filter, stride=1, padding='SAME')
File "/home/mepstein/DeepSpeech/dsenv/lib/python3.7/site-packages/tensorflow/python/util/deprecation.py", line 574, in new_func
return func(*args, **kwargs)
File "/home/mepstein/DeepSpeech/dsenv/lib/python3.7/site-packages/tensorflow/python/util/deprecation.py", line 574, in new_func
return func(*args, **kwargs)
File "/home/mepstein/DeepSpeech/dsenv/lib/python3.7/site-packages/tensorflow/python/ops/nn_ops.py", line 3482, in conv1d
data_format=data_format)
File "/home/mepstein/DeepSpeech/dsenv/lib/python3.7/site-packages/tensorflow/python/ops/gen_nn_ops.py", line 1026, in conv2d
data_format=data_format, dilations=dilations, name=name)
File "/home/mepstein/DeepSpeech/dsenv/lib/python3.7/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
op_def=op_def)
File "/home/mepstein/DeepSpeech/dsenv/lib/python3.7/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "/home/mepstein/DeepSpeech/dsenv/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 3300, in create_op
op_def=op_def)
File "/home/mepstein/DeepSpeech/dsenv/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 1801, in __init__
self._traceback = tf_stack.extract_stack()
UnknownError (see above for traceback): Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[node tower_0/conv1d/Conv2D (defined at DeepSpeech.py:56) ]]
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 40 (5 by maintainers)
Commits related to this issue
- Document TF_FORCE_GPU_ALLOW_GROWTH Fixes #2211 — committed to lissyx/STT by deleted user 5 years ago
- Document TF_FORCE_GPU_ALLOW_GROWTH Fixes #2211 — committed to reuben/STT by deleted user 5 years ago
- Document TF_FORCE_GPU_ALLOW_GROWTH Fixes #2211 — committed to rcgale/DeepSpeech by deleted user 5 years ago
Ok finally got it to work (still with
CUDA_VISIBLE_DEVICES=0
) by updating allow_growth as an environment variable. I added:os.environ['TF_FORCE_GPU_ALLOW_GROWTH'] = 'true'
to top of DeepSpeech.py…and now
./bin/run-ldc93s1.sh
trains without errors.Thank you! I’ve been dealing with this problem for a LONG time and this finally solved it.
Thanks !
So it would confirm it’s this
allow_growth
and just that your way of setting it was wrong. i’d like to understand better what that option does, and if we should use it.