tensorflow: DataLoss error on TFRecords - happens on one machine doesn't on other

System information

System 1 (Bug DOESN’T occur)

All details are from inside the docker which the codes run in

Have I written custom code (as opposed to using a stock example script provided in TensorFlow): True
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 16.04.1, 4.15.0-29-generic
TensorFlow installed from (source or binary): pip install tensorflow_gpu
TensorFlow version (use command below): v1.10.0-0-g656e7a2b34 1.10.0
Mobile device: N/A
Bazel version (if compiling from source): N/A
Python version: 3.6.6
GCC/Compiler version (if compiling from source): gcc (Ubuntu 5.4.0-6ubuntu1~16.04.9) 5.4.0 20160609
CUDA/cuDNN version: CUDA: V9.0.176, cuDNN 7.1.4
GPU model and memory: GTX 1080 8GB + GTX 980 Ti 6GB
Docker details: NVIDIA Docker: 2.0.2 Client: Version: 17.12.0-ce API version: 1.35 Go version: go1.9.2 Git commit: c97c6d6 Built: Wed Dec 27 20:11:19 2017 OS/Arch: linux/amd64 Experimental: false Server: Engine: Version: 17.12.0-ce API version: 1.35 (minimum version 1.12) Go version: go1.9.2 Git commit: c97c6d6 Built: Wed Dec 27 20:09:53 2017 OS/Arch: linux/amd64 Experimental: false
Exact command to reproduce: See below

System 2 (Bug DOES occur)

All details are from inside the docker which the codes run in

Have I written custom code (as opposed to using a stock example script provided in TensorFlow): True
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 16.04.1, 4.15.0-34-generic
TensorFlow installed from (source or binary): pip install tensorflow_gpu
TensorFlow version (use command below): v1.10.1-0-g4dcfddc5d1 1.10.1
Mobile device: N/A
Bazel version (if compiling from source): N/A
Python version: 3.6.6
GCC/Compiler version (if compiling from source): gcc (Ubuntu 5.4.0-6ubuntu1~16.04.10) 5.4.0 20160609
CUDA/cuDNN version: CUDA: V9.0.176, cuDNN 7.2.1
GPU model and memory: GTX 1080 Ti 12GB x 2
Docker details: NVIDIA Docker: 2.0.3 Client: Version: 18.06.0-ce API version: 1.38 Go version: go1.10.3 Git commit: 0ffa825 Built: Wed Jul 18 19:11:02 2018 OS/Arch: linux/amd64 Experimental: false Server: Engine: Version: 18.06.0-ce API version: 1.38 (minimum version 1.12) Go version: go1.10.3 Git commit: 0ffa825 Built: Wed Jul 18 19:09:05 2018 OS/Arch: linux/amd64 Experimental: false
Exact command to reproduce: See below

Describe the problem

Using the same docker image, which is an altered version of this image running the same code over the same TFRecords on one machine (System 2) I get a DataLoss exception (corrupted record) while on the other system (System 1) I don’t. I have verified using checksum that the files were transferred safely. I have also tried to retransfer the files at least two times. I have also have tried using other sets of TFRecords I created. The problem is the same in all cases, it happens on System 2, but does not happen on System 1.

The TFRecords were written on System 1 and are over 150GB in size. I have also tried rebuilding the image and restarting containers. I have also tried restarting the host machine. The code is running on one GPU, I tried using each, all fail.

Source code / logs

Due to sensitivity of the code I can’t share all of it, but I’ll paste the relevant lines.

In the training code I have 3 places where I iterate over the dataset, once when I count the number of records train_size = sum(1 for _ in tf.python_io.tf_record_iterator(meta['train_tfr_path'])). Second time when I feed the training loop

while True:
    try:
        ...
        _, batch_summary = sess.run([opt, merged], feed_dict={handle: train_handle})
        ...
    except tf.errors.OutOfRangeError:
        ...

And third time very similar code for the validation set.

After I recreate image + containers and restarting the machine, I will successfully finish the calculation of train_size (full loop over TFRecords), then I would start training, and somewhere in the training loop I would fail with the traceback attached below. If then I run the code again, I will fail on the first call to the TFRecords, with the same traceback - meaning this time it would happen on the train_size calculation. It will now forever fail on this call, until I restart the machine and the scenario repeats.

As I said, same code on different machine (System 1) never fails.

Traceback:

Traceback (most recent call last): File “train.py”, line 884, in <module> main() File “train.py”, line 878, in main train(graph, model_dir, tensorboard_dir, meta, is_recovering=recovering) File “train.py”, line 846, in train ‘sequential’]}) File “/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py”, line 877, in run run_metadata_ptr) File “/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py”, line 1100, in _run feed_dict_tensor, options, run_metadata) File “/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py”, line 1272, in _do_run run_metadata) File “/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py”, line 1291, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.DataLossError: corrupted record at 6720637495 [[Node: data/IteratorGetNext = IteratorGetNextoutput_shapes=[[?,?], [?,1], [?,?,?], [?,?,?], [?,?], …, [?,?], [?,1], [?,?,?], [?,?,?], [?,?]], output_types=[DT_FLOAT, DT_INT64, DT_INT64, DT_FLOAT, DT_INT64, …, DT_INT64, DT_INT64, DT_FLOAT, DT_INT64, DT_INT64], _device=“/job:localhost/replica:0/task:0/device:CPU:0”]]

Caused by op ‘data/IteratorGetNext’, defined at: File “train.py”, line 884, in <module> main() File “train.py”, line 874, in main graph, meta = build_model(available_gpus) File “train.py”, line 502, in build_model return build_classifier(available_gpus) File “train.py”, line 572, in build_classifier n_gpus=FLAGS.n_gpus) File “/opt/code/utils/architecture/data/v4.py”, line 309, in data_prep <deleted line for sensitivity issues> File “/usr/local/lib/python3.6/dist-packages/tensorflow/python/data/ops/iterator_ops.py”, line 410, in get_next name=name)), self._output_types, File “/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/gen_dataset_ops.py”, line 2069, in iterator_get_next output_shapes=output_shapes, name=name) File “/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/op_def_library.py”, line 787, in _apply_op_helper op_def=op_def) File “/usr/local/lib/python3.6/dist-packages/tensorflow/python/util/deprecation.py”, line 454, in new_func return func(*args, **kwargs) File “/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py”, line 3155, in create_op op_def=op_def) File “/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py”, line 1717, in init self._traceback = tf_stack.extract_stack()

DataLossError (see above for traceback): corrupted record at 6720637495 [[Node: data/IteratorGetNext = IteratorGetNextoutput_shapes=[[?,?], [?,1], [?,?,?], [?,?,?], [?,?], …, [?,?], [?,1], [?,?,?], [?,?,?], [?,?]], output_types=[DT_FLOAT, DT_INT64, DT_INT64, DT_FLOAT, DT_INT64, …, DT_INT64, DT_INT64, DT_FLOAT, DT_INT64, DT_INT64], _device=“/job:localhost/replica:0/task:0/device:CPU:0”]]

About this issue

Original URL
State: closed
Created 6 years ago
Comments: 15 (3 by maintainers)

Most upvoted comments

PROBLEM IS SOLVED

We ran a memtest on our 4 memory cards and one of the cards was faulty. We removed that card, reran the code and now everything works smoothly.

eliorc on Oct 23, 2018

@owenwork I’ll try to investigate that, since the IT guys ran all the test I don’t have that information on hand - if I’ll be able to get some answers I’ll be sure to update

eliorc on Feb 23, 2019