tensorflow: Segmentation fault with small repro
System information
- Have I written custom code (as opposed to using a stock example script provided in TensorFlow):yes
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): archlinux
- Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:n/a
- TensorFlow installed from (source or binary):binary
- TensorFlow version (use command below):b’v1.9.0-rc2-5276-ge57874169f’ 1.12.0-dev20181004
- Python version:3.6
- Bazel version (if compiling from source):n/a
- GCC/Compiler version (if compiling from source):n/a
- CUDA/cuDNN version:9.0
- GPU model and memory:1080Ti
- Exact command to reproduce:below
This code:
import tensorflow as tf
import numpy as np
def f(boxes, scores):
def f(X):
prob, box = X
output_shape = tf.shape(prob)
ids = tf.reshape(tf.where(prob > 0.05), [-1])
prob = tf.gather(prob, ids)
box = tf.gather(box, ids)
# prob = tf.Print(prob, [box, prob], summarize=100, message='boxandprob')
selection = tf.image.non_max_suppression(box, prob, 100, 0.5)
selection = tf.to_int32(tf.gather(ids, selection))
selection = tf.Print(selection, [ids, selection], summarize=100, message='ids_selection_2')
sorted_selection = -tf.nn.top_k(-selection, k=tf.size(selection))[0]
mask = tf.sparse_to_dense(
sparse_indices=sorted_selection,
output_shape=output_shape,
sparse_values=True,
default_value=False)
return mask
masks = tf.map_fn(f, (scores, boxes), dtype=tf.bool, parallel_iterations=10) # #cat x N
return masks
with tf.device('/gpu:0'):
boxes = tf.placeholder(tf.float32, (80, None, 4), name='boxes')
scores = tf.placeholder(tf.float32, (80, None), name='scores')
outs = f(boxes, scores)
config = tf.ConfigProto()
config.allow_soft_placement = True
sess = tf.Session(config=config)
data = dict(np.load('debug.npz'))
for k in range(1000):
sess.run(outs, feed_dict={boxes: data['boxes'].transpose(1, 0, 2)[1:, :, :], scores: data['scores'][:, 1:].T})
print(k)
causes segmentation fault on tf-nightly-gpu, as well as tensorflow-gpu==1.11.0. It works on 1.10.
It needs the data file debug.npz here:
debug.zip
Note:
- I tested on two machines, an error happen in >90% runs.
- The code was distilled from the bug report about MaskRCNN evaluation here. The original bug report does not always segfault, but occasionally crash with other different unreasonable TF internal errors, such as:
InvalidArgumentError (see above for traceback): scores has incompatible shape
[[node map/while/non_max_suppression/NonMaxSuppressionV3 (defined at bug.py:15) = NonMaxSuppressionV3[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](map/while/Gather
V2_1/_29, map/while/GatherV2/_31, map/while/non_max_suppression/NonMaxSuppressionV3/max_output_size/_33, map/while/non_max_suppression/iou_threshold/_35, map/while/non_max_suppression/score_thresh
old/_37)]]
2018-10-04 14:59:14.736180: F tensorflow/core/common_runtime/bfc_allocator.cc:458] Check failed: c->in_use() && (c->bin_num == kInvalidBinNum)
2018-10-04 14:59:49.523436: F tensorflow/core/common_runtime/bfc_allocator.cc:380] Check failed: h != kInvalidChunkHandle
2018-10-05 00:12:03.720295: F ./tensorflow/core/framework/tensor.h:643] Check failed: new_num_elements == NumElements() (39 vs. 0)
InvalidArgumentError (see above for traceback): indices[1] = [0] is repeated
[[{{node map/while/SparseToDense}} = SparseToDense[T=DT_BOOL, Tindices=DT_INT32, _class=["loc:@map/while/TensorArrayWrite/TensorArrayWriteV3"], validate_indices=true, _device="/job:localhost/replica:0/task:0/device:CPU:0"](map/while/Neg_1/_51, map/while/Shape/_53, map/while/SparseToDense/sparse_values/_55, map/while/SparseToDense/default_value/_57)]]
[[{{node map/while/SparseToDense/sparse_values/_54}} = _Send[T=DT_BOOL, client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_111_map/while/SparseToDense/sparse_values", _device="/job:localhost/replica:0/task:0/device:GPU:0"](map/while/SparseToDense/sparse_values)]]
After distilled to this small repro, it seems to mostly do segfault. But the above error messages might help. Seems like a memory corruption.
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Comments: 21 (13 by maintainers)
I ran into the same issue with TF 1.12. It is not deterministic and fails in about 60% of cases.
This is my error message:
It is fixed in the latest release (1.13rc0).
@tayo I believe your commit https://github.com/tensorflow/tensorflow/commit/8566d9e6fa7dbe3660339befe8b0a3344d24ef2b#diff-6731fe0e9dae6d68dca55b2d50d32c06R320 about NMS op causes this bug.
The input
Tensors of aOpKernel::Computeshould not be stored as members of theOpKernel. This effectively makeOpKernel::Computenot thread-safe and crash with my sample code above.Fix is under review.
https://github.com/tensorflow/tensorflow/commit/1845bf763b4c1c54425d9bb8b1554db79759f567 and https://github.com/tensorflow/tensorflow/commit/76e7804409ccd76c7ce08e66eb739544cd5cda68 solved this issue for me on tf-1.12.0
Also can confirm that pulling the 1.13.0 tf-nightly-gpu build solved this issue for me
I noticed that this issue does not occur with tf-nightly-gpu
1.13.0.dev20190208when compared to1.12.Hi Yuxin,
I was just made aware of this issue recently. I am working on a fix and will push this out soon. Thanks for bringing this to attention!
-Tayo
On Fri, Oct 5, 2018 at 2:58 PM Yuxin Wu notifications@github.com wrote: