tf-coriander: Latest github code segfaults on Ubuntu 16.04 / NVIDIA

Using latest github code version, on Ubuntu 16.04 / nvidia, a bunch of tests pass, but every so often (quite often, unusably often), it segfaults:

I tensorflow/core/common_runtime/gpu/gpu_device.cc:877] cannot enable peer access from device ordinal 0 to device ordina
l 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1011] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1021] 0:   N
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1083] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GRID K
520, pci bus id: 0000.0000)
cl_driver DeviceAllocate 848478208
Segmentation fault (core dumped)

example backtrace, from gdb, https://gist.github.com/hughperkins/10855efd242b0786c7dfc2aa4075e59a

This looks annoyingly hard to diagnose/debug… 😦

Edit: backtrace with debug build: https://gist.github.com/hughperkins/68f636beb90fa9c8cb6d4687acce9f05

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Comments: 22 (17 by maintainers)

Commits related to this issue

Most upvoted comments