tensorflow: Segmentation fault with small repro

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow):yes
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): archlinux
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:n/a
  • TensorFlow installed from (source or binary):binary
  • TensorFlow version (use command below):b’v1.9.0-rc2-5276-ge57874169f’ 1.12.0-dev20181004
  • Python version:3.6
  • Bazel version (if compiling from source):n/a
  • GCC/Compiler version (if compiling from source):n/a
  • CUDA/cuDNN version:9.0
  • GPU model and memory:1080Ti
  • Exact command to reproduce:below

This code:

import tensorflow as tf
import numpy as np

def f(boxes, scores):
    def f(X):
        prob, box = X
        output_shape = tf.shape(prob)
        ids = tf.reshape(tf.where(prob > 0.05), [-1])
        prob = tf.gather(prob, ids)
        box = tf.gather(box, ids)
        # prob = tf.Print(prob, [box, prob], summarize=100, message='boxandprob')
        selection = tf.image.non_max_suppression(box, prob, 100, 0.5)
        selection = tf.to_int32(tf.gather(ids, selection))
        selection = tf.Print(selection, [ids, selection], summarize=100, message='ids_selection_2')
        sorted_selection = -tf.nn.top_k(-selection, k=tf.size(selection))[0]
        mask = tf.sparse_to_dense(
            sparse_indices=sorted_selection,
            output_shape=output_shape,
            sparse_values=True,
            default_value=False)
        return mask

    masks = tf.map_fn(f, (scores, boxes), dtype=tf.bool, parallel_iterations=10)     # #cat x N
    return masks

with tf.device('/gpu:0'):
    boxes = tf.placeholder(tf.float32, (80, None, 4), name='boxes')
    scores = tf.placeholder(tf.float32, (80, None), name='scores')
    outs = f(boxes, scores)

config = tf.ConfigProto()
config.allow_soft_placement = True
sess = tf.Session(config=config)
data = dict(np.load('debug.npz'))
for k in range(1000):
    sess.run(outs, feed_dict={boxes: data['boxes'].transpose(1, 0, 2)[1:, :, :], scores: data['scores'][:, 1:].T})
    print(k)

causes segmentation fault on tf-nightly-gpu, as well as tensorflow-gpu==1.11.0. It works on 1.10. It needs the data file debug.npz here: debug.zip

Note:

  1. I tested on two machines, an error happen in >90% runs.
  2. The code was distilled from the bug report about MaskRCNN evaluation here. The original bug report does not always segfault, but occasionally crash with other different unreasonable TF internal errors, such as:
InvalidArgumentError (see above for traceback): scores has incompatible shape
         [[node map/while/non_max_suppression/NonMaxSuppressionV3 (defined at bug.py:15)  = NonMaxSuppressionV3[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](map/while/Gather
V2_1/_29, map/while/GatherV2/_31, map/while/non_max_suppression/NonMaxSuppressionV3/max_output_size/_33, map/while/non_max_suppression/iou_threshold/_35, map/while/non_max_suppression/score_thresh
old/_37)]]
2018-10-04 14:59:14.736180: F tensorflow/core/common_runtime/bfc_allocator.cc:458] Check failed: c->in_use() && (c->bin_num == kInvalidBinNum)                                                     
2018-10-04 14:59:49.523436: F tensorflow/core/common_runtime/bfc_allocator.cc:380] Check failed: h != kInvalidChunkHandle 
2018-10-05 00:12:03.720295: F ./tensorflow/core/framework/tensor.h:643] Check failed: new_num_elements == NumElements() (39 vs. 0)

InvalidArgumentError (see above for traceback): indices[1] = [0] is repeated
         [[{{node map/while/SparseToDense}} = SparseToDense[T=DT_BOOL, Tindices=DT_INT32, _class=["loc:@map/while/TensorArrayWrite/TensorArrayWriteV3"], validate_indices=true, _device="/job:localhost/replica:0/task:0/device:CPU:0"](map/while/Neg_1/_51, map/while/Shape/_53, map/while/SparseToDense/sparse_values/_55, map/while/SparseToDense/default_value/_57)]]
         [[{{node map/while/SparseToDense/sparse_values/_54}} = _Send[T=DT_BOOL, client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_111_map/while/SparseToDense/sparse_values", _device="/job:localhost/replica:0/task:0/device:GPU:0"](map/while/SparseToDense/sparse_values)]]

After distilled to this small repro, it seems to mostly do segfault. But the above error messages might help. Seems like a memory corruption.

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Comments: 21 (13 by maintainers)

Most upvoted comments

I ran into the same issue with TF 1.12. It is not deterministic and fails in about 60% of cases.

This is my error message:

F tensorflow/core/common_runtime/bfc_allocator.cc:458] Check failed: c->in_use() && (c->bin_num == kInvalidBinNum) 

It is fixed in the latest release (1.13rc0).

@tayo I believe your commit https://github.com/tensorflow/tensorflow/commit/8566d9e6fa7dbe3660339befe8b0a3344d24ef2b#diff-6731fe0e9dae6d68dca55b2d50d32c06R320 about NMS op causes this bug.

The input Tensors of a OpKernel::Compute should not be stored as members of the OpKernel. This effectively make OpKernel::Compute not thread-safe and crash with my sample code above.

Fix is under review.

Also can confirm that pulling the 1.13.0 tf-nightly-gpu build solved this issue for me

I noticed that this issue does not occur with tf-nightly-gpu 1.13.0.dev20190208 when compared to 1.12.

Hi Yuxin,

I was just made aware of this issue recently. I am working on a fix and will push this out soon. Thanks for bringing this to attention!

-Tayo

On Fri, Oct 5, 2018 at 2:58 PM Yuxin Wu notifications@github.com wrote:

@tayo https://github.com/tayo I believe your commit 8566d9e#diff-6731fe0e9dae6d68dca55b2d50d32c06R320 https://github.com/tensorflow/tensorflow/commit/8566d9e6fa7dbe3660339befe8b0a3344d24ef2b#diff-6731fe0e9dae6d68dca55b2d50d32c06R320 about NMS op causes this bug.

The input Tensors of a OpKernel::Compute should not be stored as members of the OpKernel. This effectively make OpKernel::Compute not thread-safe and crash with my sample code above.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tensorflow/tensorflow/issues/22750#issuecomment-427509033, or mute the thread https://github.com/notifications/unsubscribe-auth/AKa_OWIc5QFzYKU6NC4ZAyj1WgTB342vks5uh9YKgaJpZM4XJAFi .