DeepSpeech: Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 31 , 12, 2048]

For support and discussions, please use our Discourse forums.

If you’ve found a bug, or have a feature request, then please create an issue with the following information:

  • Have I written custom code (as opposed to running examples on an unmodified clone of the repository): no
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 18.04
  • TensorFlow installed from (our builds, or upstream TensorFlow): pip
  • TensorFlow version (use command below): 1.15
  • Python version: 3.5
  • Bazel version (if compiling from source):
  • GCC/Compiler version (if compiling from source):
  • CUDA/cuDNN version: 10.0
  • GPU model and memory: 4 gtx 1080 Ti
  • Exact command to reproduce:
andre@andrednn:~/projects/DeepSpeech$ more .compute_msprompts
#!/bin/bash

set -xe

#apt-get install -y python3-venv libopus0

#python3 -m venv /tmp/venv
#source /tmp/venv/bin/activate

#pip install -U setuptools wheel pip
#pip install .
#pip uninstall -y tensorflow
#pip install tensorflow-gpu==1.14

#mkdir -p ../keep/summaries

data="${SHARED_DIR}/data"
fis="${data}/LDC/fisher"
swb="${data}/LDC/LDC97S62/swb"
lbs="${data}/OpenSLR/LibriSpeech/librivox"
cv="${data}/mozilla/CommonVoice/en_1087h_2019-06-12/clips"
npr="${data}/NPR/WAMU/sets/v0.3"

python -u DeepSpeech.py \
  --train_files /home/andre/projects/corpora/20200404084521_msprompts_90_6s/deepspeech/treino_filtered_alphabet.csv \
  --dev_files /home/andre/projects/corpora/20200404084521_msprompts_90_6s/deepspeech/dev_filtered_alphabet.csv \
  --test_files /home/andre/projects/corpora/20200404084521_msprompts_90_6s/deepspeech/teste_filtered_alphabet.csv \
  --train_batch_size 12 \
  --dev_batch_size 24 \
  --test_batch_size 24 \
  --scorer ~/projects/corpora/deepspeech-pretrained-ptbr/kenlm.scorer \
  --alphabet_config_path ~/projects/corpora/deepspeech-pretrained-ptbr/alphabet.txt \
  --train_cudnn \
  --n_hidden 2048 \
  --learning_rate 0.0001 \
  --dropout_rate 0.40 \
  --epochs 150 \
  --noearly_stop \
  --audio_sample_rate 8000 \
  --save_checkpoint_dir ~/projects/corpora/deepspeech-fulltrain-ptbr  \
  --use_allow_growth \
  --log_level 0

I’m getting the following error when using my ptbr 8khz dataset to train. Have tried to downgrade and upgrade cuda, cudnn, nvidia-drivers, and ubuntu (16 and 18) and the error persists. I have tried with datasets containing two different characteristics: 6s and 15s in length. Both contain audios in 8khz.

andre@andrednn:~/projects/DeepSpeech$ bash .compute_msprompts
+ data=/data
+ fis=/data/LDC/fisher
+ swb=/data/LDC/LDC97S62/swb
+ lbs=/data/OpenSLR/LibriSpeech/librivox
+ cv=/data/mozilla/CommonVoice/en_1087h_2019-06-12/clips
+ npr=/data/NPR/WAMU/sets/v0.3
+ python -u DeepSpeech.py --train_files /home/andre/projects/corpora/20200404084521_msprompts_90_6s/deepspeech/treino_filtered_alphabet.csv --dev_files /home/andre/projects/corpora/20200404084521_msprompts_90_6s/deepspeech/dev_filtered_alphabet.csv --test_files /home/an
dre/projects/corpora/20200404084521_msprompts_90_6s/deepspeech/teste_filtered_alphabet.csv --train_batch_size 12 --dev_batch_size 24 --test_batch_size 24 --scorer /home/andre/projects/corpora/deepspeech-pretrained-ptbr/kenlm.scorer --alphabet_config_path /home/andre/pro
jects/corpora/deepspeech-pretrained-ptbr/alphabet.txt --train_cudnn --n_hidden 2048 --learning_rate 0.0001 --dropout_rate 0.40 --epochs 150 --noearly_stop --audio_sample_rate 8000 --save_checkpoint_dir /home/andre/projects/corpora/deepspeech-fulltrain-ptbr --use_allow_g
rowth --log_level 0
2020-06-18 12:30:07.508455: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-06-18 12:30:07.531012: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3597670000 Hz
2020-06-18 12:30:07.531588: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5178d70 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-06-18 12:30:07.531608: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-06-18 12:30:07.533960: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-06-18 12:30:09.563468: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5416390 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-06-18 12:30:09.563492: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): GeForce GTX 1080 Ti, Compute Capability 6.1
2020-06-18 12:30:09.563497: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (1): GeForce GTX 1080 Ti, Compute Capability 6.1
2020-06-18 12:30:09.563501: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (2): GeForce GTX 1080 Ti, Compute Capability 6.1
2020-06-18 12:30:09.563505: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (3): GeForce GTX 1080 Ti, Compute Capability 6.1
2020-06-18 12:30:09.570577: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 0 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:05:00.0
2020-06-18 12:30:09.571728: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 1 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:06:00.0
2020-06-18 12:30:09.572862: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 2 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:09:00.0
2020-06-18 12:30:09.573993: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 3 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:0a:00.0
2020-06-18 12:30:09.574226: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-06-18 12:30:09.575280: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2020-06-18 12:30:09.576167: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2020-06-18 12:30:09.576401: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2020-06-18 12:30:09.577541: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2020-06-18 12:30:09.578426: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2020-06-18 12:30:09.581112: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-06-18 12:30:09.589736: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1767] Adding visible gpu devices: 0, 1, 2, 3
2020-06-18 12:30:09.589770: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-06-18 12:30:09.594742: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1180] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-06-18 12:30:09.594757: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1186]      0 1 2 3
2020-06-18 12:30:09.594763: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 0:   N Y Y Y
2020-06-18 12:30:09.594767: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 1:   Y N Y Y
2020-06-18 12:30:09.594770: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 2:   Y Y N Y
2020-06-18 12:30:09.594774: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 3:   Y Y Y N
2020-06-18 12:30:09.600428: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1325] Created TensorFlow device (/device:GPU:0 with 10478 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:05:00.0, compute capability: 6.1)
2020-06-18 12:30:09.602038: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1325] Created TensorFlow device (/device:GPU:1 with 10481 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:06:00.0, compute capability: 6.1)
2020-06-18 12:30:09.603572: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1325] Created TensorFlow device (/device:GPU:2 with 10481 MB memory) -> physical GPU (device: 2, name: GeForce GTX 1080 Ti, pci bus id: 0000:09:00.0, compute capability: 6.1)
2020-06-18 12:30:09.605112: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1325] Created TensorFlow device (/device:GPU:3 with 10481 MB memory) -> physical GPU (device: 3, name: GeForce GTX 1080 Ti, pci bus id: 0000:0a:00.0, compute capability: 6.1)
swig/python detected a memory leak of type 'Alphabet *', no destructor found.
W WARNING: You specified different values for --load_checkpoint_dir and --save_checkpoint_dir, but you are running training and testing in a single invocation. The testing step will respect --load_checkpoint_dir, and thus WILL NOT TEST THE CHECKPOINT CREATED BY THE TRAI
NING STEP. Train and test in two separate invocations, specifying the correct --load_checkpoint_dir in both cases, or use the same location for loading and saving.
2020-06-18 12:30:10.102127: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 0 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:05:00.0
2020-06-18 12:30:10.103272: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 1 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:06:00.0
2020-06-18 12:30:10.104379: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 2 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:09:00.0
2020-06-18 12:30:10.105484: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 3 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:0a:00.0
2020-06-18 12:30:10.105521: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-06-18 12:30:10.105533: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2020-06-18 12:30:10.105562: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2020-06-18 12:30:10.105574: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2020-06-18 12:30:10.105586: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2020-06-18 12:30:10.105597: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2020-06-18 12:30:10.105610: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-06-18 12:30:10.114060: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1767] Adding visible gpu devices: 0, 1, 2, 3
WARNING:tensorflow:From /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/data/ops/iterator_ops.py:347: Iterator.output_types (from tensorflow.python.data.ops.iterator_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.compat.v1.data.get_output_types(iterator)`.
W0618 12:30:10.218584 139639980619584 deprecation.py:323] From /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/data/ops/iterator_ops.py:347: Iterator.output_types (from tensorflow.python.data.ops.iterator_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.compat.v1.data.get_output_types(iterator)`.
WARNING:tensorflow:From /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/data/ops/iterator_ops.py:348: Iterator.output_shapes (from tensorflow.python.data.ops.iterator_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.compat.v1.data.get_output_shapes(iterator)`.
W0618 12:30:10.218781 139639980619584 deprecation.py:323] From /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/data/ops/iterator_ops.py:348: Iterator.output_shapes (from tensorflow.python.data.ops.iterator_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.compat.v1.data.get_output_shapes(iterator)`.
WARNING:tensorflow:From /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/data/ops/iterator_ops.py:350: Iterator.output_classes (from tensorflow.python.data.ops.iterator_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.compat.v1.data.get_output_classes(iterator)`.
W0618 12:30:10.218892 139639980619584 deprecation.py:323] From /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/data/ops/iterator_ops.py:350: Iterator.output_classes (from tensorflow.python.data.ops.iterator_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.compat.v1.data.get_output_classes(iterator)`.
WARNING:tensorflow:
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

W0618 12:30:10.324707 139639980619584 lazy_loader.py:50]
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

WARNING:tensorflow:From /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/contrib/cudnn_rnn/python/layers/cudnn_rnn.py:342: calling GlorotUniform.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
W0618 12:30:10.326326 139639980619584 deprecation.py:506] From /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/contrib/cudnn_rnn/python/layers/cudnn_rnn.py:342: calling GlorotUniform.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
WARNING:tensorflow:From /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/contrib/cudnn_rnn/python/layers/cudnn_rnn.py:345: calling Constant.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
W0618 12:30:10.326326 139639980619584 deprecation.py:506] From /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/contrib/cudnn_rnn/python/layers/cudnn_rnn.py:342: calling GlorotUniform.__init__ (from tensorflow.python.ops.init_ops) with dt
ype is deprecated and will be removed in a future version.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
WARNING:tensorflow:From /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/contrib/cudnn_rnn/python/layers/cudnn_rnn.py:345: calling Constant.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a f
uture version.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
W0618 12:30:10.326584 139639980619584 deprecation.py:506] From /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/contrib/cudnn_rnn/python/layers/cudnn_rnn.py:345: calling Constant.__init__ (from tensorflow.python.ops.init_ops) with dtype i
s deprecated and will be removed in a future version.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
WARNING:tensorflow:From /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/deepspeech_training/train.py:246: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
W0618 12:30:10.401312 139639980619584 deprecation.py:323] From /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/deepspeech_training/train.py:246: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
WARNING:tensorflow:From /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/training/slot_creator.py:193: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts.
W0618 12:30:11.297271 139639980619584 deprecation.py:323] From /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/training/slot_creator.py:193: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will
be removed in a future version.
Instructions for updating:
Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts.
2020-06-18 12:30:11.458650: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 0 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:05:00.0
2020-06-18 12:30:11.459790: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 1 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:06:00.0
2020-06-18 12:30:11.460897: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 2 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:09:00.0
2020-06-18 12:30:11.462003: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 3 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:0a:00.0
2020-06-18 12:30:11.462041: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-06-18 12:30:11.462071: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2020-06-18 12:30:11.462085: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2020-06-18 12:30:11.462097: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2020-06-18 12:30:11.462109: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2020-06-18 12:30:11.462121: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2020-06-18 12:30:11.462133: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-06-18 12:30:11.470539: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1767] Adding visible gpu devices: 0, 1, 2, 3
2020-06-18 12:30:11.470679: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1180] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-06-18 12:30:11.470694: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1186]      0 1 2 3
2020-06-18 12:30:11.470699: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 0:   N Y Y Y
2020-06-18 12:30:11.470703: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 1:   Y N Y Y
2020-06-18 12:30:11.470707: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 2:   Y Y N Y
2020-06-18 12:30:11.470710: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 3:   Y Y Y N
2020-06-18 12:30:11.476196: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1325] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10478 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:05:00.0, compute ca
pability: 6.1)
2020-06-18 12:30:11.477355: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1325] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 10481 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:06:00.0, compute ca
pability: 6.1)
2020-06-18 12:30:11.478490: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1325] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 10481 MB memory) -> physical GPU (device: 2, name: GeForce GTX 1080 Ti, pci bus id: 0000:09:00.0, compute ca
pability: 6.1)
2020-06-18 12:30:11.479608: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1325] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 10481 MB memory) -> physical GPU (device: 3, name: GeForce GTX 1080 Ti, pci bus id: 0000:0a:00.0, compute ca
pability: 6.1)
D Session opened.
I Could not find best validating checkpoint.
I Could not find most recent checkpoint.
I Initializing all variables.
2020-06-18 12:30:12.233482: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
I STARTING Optimization
Epoch 0 |   Training | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000                                             2020-06-18 12:30:14.672316: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
Epoch 0 |   Training | Elapsed Time: 0:00:16 | Steps: 33 | Loss: 18.239303                                                                                                                                                                                                   2
020-06-18 12:30:30.589204: E tensorflow/stream_executor/dnn.cc:588] CUDNN_STATUS_EXECUTION_FAILED
in tensorflow/stream_executor/cuda/cuda_dnn.cc(1778): 'cudnnRNNForwardTrainingEx( cudnn.handle(), rnn_desc.handle(), input_desc.data_handle(), input_data.opaque(), input_h_desc.handle(), input_h_data.opaque(), input_c_desc.handle(), input_c_data.opaque(), rnn_desc.param
s_handle(), params.opaque(), output_desc.data_handle(), output_data->opaque(), output_h_desc.handle(), output_h_data->opaque(), output_c_desc.handle(), output_c_data->opaque(), nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, workspace.opaque(), w
orkspace.size(), reserve_space.opaque(), reserve_space.size())'
2020-06-18 12:30:30.589243: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at cudnn_rnn_ops.cc:1517 : Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_uni
ts, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 63, 12, 2048]
Traceback (most recent call last):
  File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
    return fn(*args)
  File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
    target_list, run_metadata)
  File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
  (0) Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 63, 12, 2048]
         [[{{node tower_0/cudnn_lstm/CudnnRNNV3_1}}]]
  (1) Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 63, 12, 2048]
         [[{{node tower_0/cudnn_lstm/CudnnRNNV3_1}}]]
         [[tower_2/CTCLoss/_147]]
1 successful operations.
2 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "DeepSpeech.py", line 12, in <module>
    ds_train.run_script()
  File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/deepspeech_training/train.py", line 968, in run_script
    absl.app.run(main)
  File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/deepspeech_training/train.py", line 940, in main
    train()
  File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/deepspeech_training/train.py", line 608, in train
    train_loss, _ = run_set('train', epoch, train_init_op)
  File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/deepspeech_training/train.py", line 568, in run_set
    feed_dict=feed_dict)
  File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 956, in run
    run_metadata_ptr)
  File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1180, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
    run_metadata)
  File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
  (0) Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 63, 12, 2048]
         [[node tower_0/cudnn_lstm/CudnnRNNV3_1 (defined at /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
  (1) Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 63, 12, 2048]
         [[node tower_0/cudnn_lstm/CudnnRNNV3_1 (defined at /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
         [[tower_2/CTCLoss/_147]]
1 successful operations.
2 derived errors ignored.

Original stack trace for 'tower_0/cudnn_lstm/CudnnRNNV3_1':
  File "DeepSpeech.py", line 12, in <module>
    ds_train.run_script()
  File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/deepspeech_training/train.py", line 968, in run_script
    absl.app.run(main)
  File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/deepspeech_training/train.py", line 940, in main
    train()

  File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/deepspeech_training/train.py", line 487, in train
    gradients, loss, non_finite_files = get_tower_results(iterator, optimizer, dropout_rates)
  File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/deepspeech_training/train.py", line 313, in get_tower_results
    avg_loss, non_finite_files = calculate_mean_edit_distance_and_loss(iterator, dropout_rates, reuse=i > 0)
  File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/deepspeech_training/train.py", line 240, in calculate_mean_edit_distance_and_loss
    logits, _ = create_model(batch_x, batch_seq_len, dropout, reuse=reuse, rnn_impl=rnn_impl)
  File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/deepspeech_training/train.py", line 191, in create_model
    output, output_state = rnn_impl(layer_3, seq_length, previous_state, reuse)
  File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/deepspeech_training/train.py", line 129, in rnn_impl_cudnn_rnn
    sequence_lengths=seq_length)
  File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/layers/base.py", line 548, in __call__
    outputs = super(Layer, self).__call__(inputs, *args, **kwargs)
  File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/base_layer.py", line 854, in __call__
    outputs = call_fn(cast_inputs, *args, **kwargs)
  File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/autograph/impl/api.py", line 234, in wrapper
    return converted_call(f, options, args, kwargs)
  File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/autograph/impl/api.py", line 439, in converted_call
    return _call_unconverted(f, args, kwargs, options)
  File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/autograph/impl/api.py", line 330, in _call_unconverted
    return f(*args, **kwargs)
  File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/contrib/cudnn_rnn/python/layers/cudnn_rnn.py", line 440, in call
    training)
  File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/contrib/cudnn_rnn/python/layers/cudnn_rnn.py", line 518, in _forward
    seed=self._seed)
  File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/contrib/cudnn_rnn/python/ops/cudnn_rnn_ops.py", line 1132, in _cudnn_rnn
    outputs, output_h, output_c, _, _ = gen_cudnn_rnn_ops.cudnn_rnnv3(**args)
  File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/ops/gen_cudnn_rnn_ops.py", line 2051, in cudnn_rnnv3
    time_major=time_major, name=name)
  File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper
    op_def=op_def)
  File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op
    attrs, op_def, compute_device)
  File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal
    op_def=op_def)
  File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 1748, in __init__
    self._traceback = tf_stack.extract_stack()

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 155 (9 by maintainers)

Commits related to this issue

Most upvoted comments

Still not working for me with up to date master and newly created docker container. But as mentioned somewhere above, running export TF_CUDNN_RESET_RND_GEN_STATE=1 solved my problem.

I confirm that the flag addressed my issues and that managed me to train and have a fully functioning model.

@lissyx @reuben

OK I have done some more runs:

I ran train_debug_As_Bs_Cs.csv with batch sizes 1 and 2:

Batch size 1 trains fine. Batch size 2 blows up on the step with files: B/98_2923_a387275540ba5f2159c37eaee3e4e9a0-651926517a6241fd9bb5942777b1f0ff.wav B/154_4738_2f841fb1af523c579414e0358ab16295-6aea9aa95b1bdbfd80703754cd8a180c.wav

So I made some new csv files with:

batch A: two files from the original batch A
batch B: two files B/98_2923 and B/154_4738 from batch B
batch C: two files from the original batch C

And I made some variant of that:

train_debug_mini_As_Bs_Cs.csv
train_debug_mini_Bs_As_Cs.csv
train_debug_mini_Bs_As_Cs_B_swapped.csv
train_debug_mini_As_Bs_Cs_B_swapped.csv
train_debug_mini_As_Bs_Cs_B_mixed_A.csv
train_debug_mini_As_Bs_Cs_B_mixed_C.csv
train_debug_mini_As_Bs_Cs_B_mixed_C_2.csv
train_debug_mini_As_Bs_Cs_B_swapped_C_mixed.csv

The results of that:

With batch size 1, these all workout fine (as expected).
With batch size 2:
train_debug_mini_As_Bs_Cs.csv
    blows up in step 1, which is batch B.

train_debug_mini_As_Bs_Cs_B_swapped.csv
    blows up in step 1, which is batch B, so swapping the order within B doesn't make a difference.

train_debug_mini_Bs_As_Cs.csv
    works fine, B is the first step 0.
    as expected as the first step seems to be a special case.

train_debug_mini_Bs_As_Cs_B_swapped.csv
    works fine, B is the first step 0, so swapping the order in B doesn't make a difference.
    as expected as the first step seems to be a special case.

train_debug_mini_As_Bs_Cs_B_mixed_A.csv
    blows up in step 1, which is:
        A/155_4757
        B/154_4738

train_debug_mini_As_Bs_Cs_B_mixed_C.csv
    blows up in step 1, which is:
        B/98_2923
        C/169_5271

train_debug_mini_As_Bs_Cs_B_mixed_C_2.csv
    blows up in step 1, which is:
        C/169_5271
        B/98_2923

train_debug_mini_As_Bs_Cs_B_swapped_C_mixed.csv
    blows up in step 2, which is:
        B/98_2923
        C/169_5271

    while it did complete step 1, which is:
        B/154_4738
        C/175_5429

My interpretation of this all:

  • batch size 1 always works, so it is not completly file specific
  • with batch size 2 both B/98_2923 and B/154_4738 appear in blowups.
  • with batch size 2 B/154_4738 appears in both a blowup and a succeeded step.
  • from the previous expiriments we know that when you mix batch B in a much larger pool of (more different) files, all works out well.

So it is a bit odd, I’m starting to wonder if this is some edge case where we hit some math operation blowing up. But both files from B have slightly different file sizes and both blow up in combinations with other files with slightly different file sizes (from A and C).

So I’m a bit lost now, you have more insight in how things get processed, hopefully you have some more ideas based on that.

CSV’s and logs are attached (sample files from the previous post can be used) train_debug_mini.tar.gz

I also witnessed this. And, I found it’s related with Computer Memory usage of python. I have 12 GB graphic card, and if python uses more than 12GB, the error occurs. You can see Memory usage of python in "Task Manager " of windows. So, I reduced my batch size, to reduce Memory usage. And used TF_CUDNN_RESET_RND_GEN_STATE=1 to solve the problem. Hope this can help to figure out the problem.

Reducing the number of batch size from 64 to 32 for training and 32 to 16 for test and dev data solved this issue.

First would be to check if a custom build TF14 doesn’t have the problem (with the 7.4.1.5 cudnn and/or the newest). If so it would point to a change in TF, if not … nah don’t think about that yet …

yeah that’s what I’m doing …

And at least I repro with this build as well.

After lot of hacking, I’ve been able to rebuild locally outside of their docker (easier for playing with gdb), building and running against a pyenv-built python, and that builds reproduces the issue, so I’m preparing a debug build.

Updated the table above, I think I’m convinced enough to say that the TF14 image doesn’t have the problem. Hope you succeed in pinning it to a particular cudnn version.

I just verified and I repro with cudnn v7.6.1 as well. I think I should try and rebuild tf 1.15.2 docker with cudnn 7.6, 7.5 and 7.4 to assert here.

I ran the test with different drivers, preliminary results (will do a long test after this):

Nvidia host driver docker base image short tests long test
440.100 tensorflow/tensorflow:1.14.0-gpu-py3 worked
440.100 tensorflow/tensorflow:1.15.2-gpu-py3 failed
430.64 tensorflow/tensorflow:1.14.0-gpu-py3 worked
430.64 tensorflow/tensorflow:1.15.2-gpu-py3 failed
450.57 tensorflow/tensorflow:1.14.0-gpu-py3 worked worked
450.57 tensorflow/tensorflow:1.15.2-gpu-py3 failed failed

440.100 was the driver I was using originally. 430.64 the driver downloadable just below the 431.36 that was reported as working on the TF forum (could be the versioning of Nvidia is different so it is actually not below 431.36, but it was my best guess). 450.57 the latest stable driver released yesterday.

So from this I would take that the host driver version doesn’t matter. And I haven’t been able to prove that the TF14 image doesn’t work 😃

Will start a long test now with the TF14 image.

There are some hints on some of the reports it might be related to the ordering of sequence_length, i’d like to get a better grasp at that, confirm and so maybe we could at least have some tooling / workaround to help about that.

@applied-machinelearning For fun, at some point, some combination of dataset, driver and tensorflow version on our codebase would trigger a power surge on my hardware at home, and it was too much for my PSU that was shutting down 😕

Another question is, I saw the inference side of DeepSpeech seems to work now on tensorflow 2.x, how much work would the training side be ?

Lots That is unfortunate.

Another question is, I saw the inference side of DeepSpeech seems to work now on tensorflow 2.x, how much work would the training side be ?

@reuben Had a look at that, he knows better.

What would you like to have shared, only the csv or also the samples (as I think it would be somewhere in the samples and not the transcripts (but of course I could be wrong) ?

I think you should need to share audio + csv

OK, will do.

Merely reduced the problem-space, not of the tensorflow / deepspeech internals. And it would be nice if people could confirm (so it can be semi-worked around by not sorting).

Sure, but given the current workload, I really cannot promise having time to reproduce that: I am still lagging behind a lot of other super-urgents matters, sadly (thank you covid-19).

OK, I will do some more experiments then, try to pinpoint it some more. Try to find out if only the batch content matters, or also the state the graph /weights are in from the previous steps. If only the batch content matters, I will test what happens if you only shuffle that.