tensorflow: Segmentation fault with small repro

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow):yes
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): archlinux
Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:n/a
TensorFlow installed from (source or binary):binary
TensorFlow version (use command below):b’v1.9.0-rc2-5276-ge57874169f’ 1.12.0-dev20181004
Python version:3.6
Bazel version (if compiling from source):n/a
GCC/Compiler version (if compiling from source):n/a
CUDA/cuDNN version:9.0
GPU model and memory:1080Ti
Exact command to reproduce:below

This code:

import tensorflow as tf
import numpy as np

def f(boxes, scores):
    def f(X):
        prob, box = X
        output_shape = tf.shape(prob)
        ids = tf.reshape(tf.where(prob > 0.05), [-1])
        prob = tf.gather(prob, ids)
        box = tf.gather(box, ids)
        # prob = tf.Print(prob, [box, prob], summarize=100, message='boxandprob')
        selection = tf.image.non_max_suppression(box, prob, 100, 0.5)
        selection = tf.to_int32(tf.gather(ids, selection))
        selection = tf.Print(selection, [ids, selection], summarize=100, message='ids_selection_2')
        sorted_selection = -tf.nn.top_k(-selection, k=tf.size(selection))[0]
        mask = tf.sparse_to_dense(
            sparse_indices=sorted_selection,
            output_shape=output_shape,
            sparse_values=True,
            default_value=False)
        return mask

    masks = tf.map_fn(f, (scores, boxes), dtype=tf.bool, parallel_iterations=10)     # #cat x N
    return masks

with tf.device('/gpu:0'):
    boxes = tf.placeholder(tf.float32, (80, None, 4), name='boxes')
    scores = tf.placeholder(tf.float32, (80, None), name='scores')
    outs = f(boxes, scores)

config = tf.ConfigProto()
config.allow_soft_placement = True
sess = tf.Session(config=config)
data = dict(np.load('debug.npz'))
for k in range(1000):
    sess.run(outs, feed_dict={boxes: data['boxes'].transpose(1, 0, 2)[1:, :, :], scores: data['scores'][:, 1:].T})
    print(k)

causes segmentation fault on tf-nightly-gpu, as well as tensorflow-gpu==1.11.0. It works on 1.10. It needs the data file debug.npz here: debug.zip

Note:

I tested on two machines, an error happen in >90% runs.
The code was distilled from the bug report about MaskRCNN evaluation here. The original bug report does not always segfault, but occasionally crash with other different unreasonable TF internal errors, such as:

InvalidArgumentError (see above for traceback): scores has incompatible shape
         [[node map/while/non_max_suppression/NonMaxSuppressionV3 (defined at bug.py:15)  = NonMaxSuppressionV3[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](map/while/Gather
V2_1/_29, map/while/GatherV2/_31, map/while/non_max_suppression/NonMaxSuppressionV3/max_output_size/_33, map/while/non_max_suppression/iou_threshold/_35, map/while/non_max_suppression/score_thresh
old/_37)]]

2018-10-04 14:59:14.736180: F tensorflow/core/common_runtime/bfc_allocator.cc:458] Check failed: c->in_use() && (c->bin_num == kInvalidBinNum)

2018-10-04 14:59:49.523436: F tensorflow/core/common_runtime/bfc_allocator.cc:380] Check failed: h != kInvalidChunkHandle

2018-10-05 00:12:03.720295: F ./tensorflow/core/framework/tensor.h:643] Check failed: new_num_elements == NumElements() (39 vs. 0)


InvalidArgumentError (see above for traceback): indices[1] = [0] is repeated
         [[{{node map/while/SparseToDense}} = SparseToDense[T=DT_BOOL, Tindices=DT_INT32, _class=["loc:@map/while/TensorArrayWrite/TensorArrayWriteV3"], validate_indices=true, _device="/job:localhost/replica:0/task:0/device:CPU:0"](map/while/Neg_1/_51, map/while/Shape/_53, map/while/SparseToDense/sparse_values/_55, map/while/SparseToDense/default_value/_57)]]
         [[{{node map/while/SparseToDense/sparse_values/_54}} = _Send[T=DT_BOOL, client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_111_map/while/SparseToDense/sparse_values", _device="/job:localhost/replica:0/task:0/device:GPU:0"](map/while/SparseToDense/sparse_values)]]

After distilled to this small repro, it seems to mostly do segfault. But the above error messages might help. Seems like a memory corruption.

About this issue

Original URL
State: closed
Created 6 years ago
Comments: 21 (13 by maintainers)

Most upvoted comments

I ran into the same issue with TF 1.12. It is not deterministic and fails in about 60% of cases.

This is my error message:

F tensorflow/core/common_runtime/bfc_allocator.cc:458] Check failed: c->in_use() && (c->bin_num == kInvalidBinNum)

jendelel on Dec 12, 2018

It is fixed in the latest release (1.13rc0).

ppwwyyxx on Jan 28, 2019

@tayo I believe your commit https://github.com/tensorflow/tensorflow/commit/8566d9e6fa7dbe3660339befe8b0a3344d24ef2b#diff-6731fe0e9dae6d68dca55b2d50d32c06R320 about NMS op causes this bug.

The input Tensors of a OpKernel::Compute should not be stored as members of the OpKernel. This effectively make OpKernel::Compute not thread-safe and crash with my sample code above.

ppwwyyxx on Oct 5, 2018

Fix is under review.

tayo on Oct 17, 2018

https://github.com/tensorflow/tensorflow/commit/1845bf763b4c1c54425d9bb8b1554db79759f567 and https://github.com/tensorflow/tensorflow/commit/76e7804409ccd76c7ce08e66eb739544cd5cda68 solved this issue for me on tf-1.12.0

lxl910915 on Dec 19, 2019

Also can confirm that pulling the 1.13.0 tf-nightly-gpu build solved this issue for me

wronk on Feb 19, 2019

I noticed that this issue does not occur with tf-nightly-gpu 1.13.0.dev20190208 when compared to 1.12.

geek101 on Feb 10, 2019

Hi Yuxin,

I was just made aware of this issue recently. I am working on a fix and will push this out soon. Thanks for bringing this to attention!

-Tayo

On Fri, Oct 5, 2018 at 2:58 PM Yuxin Wu notifications@github.com wrote:

@tayo https://github.com/tayo I believe your commit 8566d9e#diff-6731fe0e9dae6d68dca55b2d50d32c06R320 https://github.com/tensorflow/tensorflow/commit/8566d9e6fa7dbe3660339befe8b0a3344d24ef2b#diff-6731fe0e9dae6d68dca55b2d50d32c06R320 about NMS op causes this bug.

The input Tensors of a OpKernel::Compute should not be stored as members of the OpKernel. This effectively make OpKernel::Compute not thread-safe and crash with my sample code above.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tensorflow/tensorflow/issues/22750#issuecomment-427509033, or mute the thread https://github.com/notifications/unsubscribe-auth/AKa_OWIc5QFzYKU6NC4ZAyj1WgTB342vks5uh9YKgaJpZM4XJAFi .

tayo on Oct 5, 2018