tf-coriander: Latest github code segfaults on Ubuntu 16.04 / NVIDIA
Using latest github code version, on Ubuntu 16.04 / nvidia, a bunch of tests pass, but every so often (quite often, unusably often), it segfaults:
I tensorflow/core/common_runtime/gpu/gpu_device.cc:877] cannot enable peer access from device ordinal 0 to device ordina
l 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1011] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1021] 0: N
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1083] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GRID K
520, pci bus id: 0000.0000)
cl_driver DeviceAllocate 848478208
Segmentation fault (core dumped)
example backtrace, from gdb, https://gist.github.com/hughperkins/10855efd242b0786c7dfc2aa4075e59a
This looks annoyingly hard to diagnose/debug… 😦
Edit: backtrace with debug build: https://gist.github.com/hughperkins/68f636beb90fa9c8cb6d4687acce9f05
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Comments: 22 (17 by maintainers)
Commits related to this issue
- spammy potential fix for https://github.com/hughperkins/tensorflow-cl/issues/34 — committed to hughperkins/tf-coriander by hughperkins 7 years ago
Fixed in https://github.com/hughperkins/tf-coriander/compare/8a02ae2a3a731...af9f284bfa27deae . Yay .