DeepSpeech: Restoring from checkpoint failed.

virtualenv --python=python3.6 env

source env/bin/activate

git clone https://github.com/mozilla/DeepSpeech
git checkout v0.6.0

downloaded v0.6.0 pretrained checkpoint
https://github.com/mozilla/DeepSpeech/releases/download/v0.6.0/deepspeech-0.6.0-checkpoint.tar.gz

cd DeepSpeech
pip install -r requirements.txt

pip install tensorflow-gpu == 1.14.0

pip3 install $(python3 util/taskcluster.py --decoder)

Continuing training from a release model:
mkdir fine_tuning_checkpoints
python3 DeepSpeech.py --n_hidden 2048 --checkpoint_dir ./deepspeech-0.6.0-checkpoint --epochs 3 --train_files ./data/csv_files/train.csv --dev_files ./data/csv_files/dev.csv --test_files ./data/csv_files/test.csv --learning_rate 0.0001

Instructions for updating:
Use standard file APIs to check for files with this prefix.
W1206 06:45:41.998423 140389067556672 deprecation.py:323] From /media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/training/saver.py:1276: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file APIs to check for files with this prefix.
INFO:tensorflow:Restoring parameters from ./deepspeech-0.6.0-checkpoint/best_dev-233784
I1206 06:45:42.020016 140389067556672 saver.py:1280] Restoring parameters from ./deepspeech-0.6.0-checkpoint/best_dev-233784
Traceback (most recent call last):
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1356, in _do_call
    return fn(*args)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1341, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1429, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.NotFoundError: Key cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/bias/Adam not found in checkpoint
	 [[{{node save_1/RestoreV2}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 1286, in restore
    {self.saver_def.filename_tensor_name: save_path})
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 950, in run
    run_metadata_ptr)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1173, in _run
    feed_dict_tensor, options, run_metadata)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1350, in _do_run
    run_metadata)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1370, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.NotFoundError: Key cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/bias/Adam not found in checkpoint
	 [[node save_1/RestoreV2 (defined at DeepSpeech.py:495) ]]

Original stack trace for 'save_1/RestoreV2':
  File "DeepSpeech.py", line 965, in <module>
    absl.app.run(main)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "DeepSpeech.py", line 938, in main
    train()
  File "DeepSpeech.py", line 495, in train
    best_dev_saver = tfv1.train.Saver(max_to_keep=1)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 825, in __init__
    self.build()
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 837, in build
    self._build(self._filename, build_save=True, build_restore=True)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 875, in _build
    build_restore=build_restore)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 508, in _build_internal
    restore_sequentially, reshape)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 328, in _AddRestoreOps
    restore_sequentially)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 575, in bulk_restore
    return io_ops.restore_v2(filename_tensor, names, slices, dtypes)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/ops/gen_io_ops.py", line 1696, in restore_v2
    name=name)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
    op_def=op_def)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3616, in create_op
    op_def=op_def)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 2005, in __init__
    self._traceback = tf_stack.extract_stack()


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 1296, in restore
    names_to_keys = object_graph_key_mapping(save_path)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 1614, in object_graph_key_mapping
    object_graph_string = reader.get_tensor(trackable.OBJECT_GRAPH_PROTO_KEY)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 678, in get_tensor
    return CheckpointReader_GetTensor(self, compat.as_bytes(tensor_str))
tensorflow.python.framework.errors_impl.NotFoundError: Key _CHECKPOINTABLE_OBJECT_GRAPH not found in checkpoint

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "DeepSpeech.py", line 965, in <module>
    absl.app.run(main)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "DeepSpeech.py", line 938, in main
    train()
  File "DeepSpeech.py", line 554, in train
    loaded = try_loading(session, best_dev_saver, 'best_dev_checkpoint', 'best validation')
  File "DeepSpeech.py", line 403, in try_loading
    saver.restore(session, checkpoint_path)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 1302, in restore
    err, "a Variable name or other graph key that is missing")
tensorflow.python.framework.errors_impl.NotFoundError: Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

Key cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/bias/Adam not found in checkpoint
	 [[node save_1/RestoreV2 (defined at DeepSpeech.py:495) ]]

Original stack trace for 'save_1/RestoreV2':
  File "DeepSpeech.py", line 965, in <module>
    absl.app.run(main)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "DeepSpeech.py", line 938, in main
    train()
  File "DeepSpeech.py", line 495, in train
    best_dev_saver = tfv1.train.Saver(max_to_keep=1)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 825, in __init__
    self.build()
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 837, in build
    self._build(self._filename, build_save=True, build_restore=True)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 875, in _build
    build_restore=build_restore)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 508, in _build_internal
    restore_sequentially, reshape)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 328, in _AddRestoreOps
    restore_sequentially)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 575, in bulk_restore
    return io_ops.restore_v2(filename_tensor, names, slices, dtypes)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/ops/gen_io_ops.py", line 1696, in restore_v2
    name=name)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
    op_def=op_def)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3616, in create_op
    op_def=op_def)
  File "/media/user1/storage-1/Murugan/DeepSpeech/env/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 2005, in __init__
    self._traceback = tf_stack.extract_stack()

OS Platform and Distribution: Linux Ubuntu 16.04
TensorFlow version: 1.14.0
Python version: 3.6.5
CUDA/cuDNN version: 10.0
GPU model and memory: 24 GB x 4 GPUs

How to resolve this issue? i was followed right instructions but why it is happened?

About this issue

Original URL
State: closed
Created 5 years ago
Comments: 28 (1 by maintainers)

Most upvoted comments

You have to specify --use_cudnn_rnn, it’s not enabled by default.

reuben on Dec 6, 2019

@lissyx sir. my CuDNN setup might be wrong?

@lissyx sir. how to resolve this issue? what is the problem here i did? 😃

tensorflow.python.framework.errors_impl.NotFoundError: Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

Key cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/bias/Adam not found in checkpoint
	 [[node save_1/RestoreV2 (defined at DeepSpeech.py:495) ]]

Ok, I can’t keep repeating over and over the same things. I told you: the error is because it cannot resume using CuDNN. Check your setup if it is supposed to work.

lissyx on Dec 6, 2019

here i am case 2. my system is capable of using CuDNN RNN. then normally with --checkpoint_dir is enough for me.

but why i need -cudnn_checkpoint?

This is what I asked you in the beginning, if your setup was properly done for CuDNN. The error obviously suggests it’s not the case.

lissyx on Dec 6, 2019

Weird. I remember this error when loading a cudnn checkpoint on a non cudnn setup, can you check that? I’ the release notes we also document the flag to use in that case, can you test with it?

lissyx on Dec 6, 2019

@MuruganR96 Can you share pip list output ?

lissyx on Dec 6, 2019