tensorflow: GPU race conditions from `tf.map_fn`

Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow): All the code that caused this issue uses Tensorflow/Keras operations.
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 16.04
TensorFlow installed from (source or binary): binary
TensorFlow version (use command below): 1.14.0
Python version: 3.6.8
CUDA/cuDNN version: 10.0/7.6.2
GPU model and memory: RTX 6000x2, 48 GB

Describe the current behavior

I’ve created a custom layer called ROI in Keras that uses tf.map_fn, precisely because it has unknown parameter that it needs to take as tensor object.

This layer works perfectly on CPU inference and training, it also works perfectly on GPU during inference. But during training with a powerful GPU, exception about GPU colocation of ROI layer occurs:

tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot assign a device for operation ROI/map/while/Identity_1: Could not satisfy explicit device specification '' because the node node ROI/map/while/Identity_1 (defined at /path/to/custom/layer/custom.py:70) placed on device No device assignments were active during op 'ROI/map/while/Identity_1' creation.

[[node ROI/map/while/Identity_1 (defined at /path/to/custom/layer/custom.py:70) Additional information about colocations: No node-device colocations were active during                                                                                                                                                              op 'ROI/map/while/Identity_1' creation.
No device assignments were active during op 'ROI/map/while/Identity_1' creation.`

Manual colocation of ROI layer to CPU device with tf.device worked, but I want ROI to support GPU as well.

My hypothesis

ROI layer works on CPU because only single core at a time should handle the layer - even if multiprocessing is activated, there are few cores slowly balancing the task.

But whenever GPU is utilized, thousands of cores are working together in parallel and they are not waiting for each other to finish their tasks. Thus one of the processes tries to gather data from TensorArray that is still in while loop (using tf.map_fn), which causes the error.

Describe the expected behavior

Tensorflow should be able to handle these race conditions by waiting for its own tf.map_fn to be finished instead of raising exceptions.

Code to reproduce the issue

This is the code that instantly causes the mentioned issue on my local machine.

Other info / logs

Full Log:

Traceback (most recent call last):
  File "/path/to/site-packages/tensorflow/python/client/session.py", line 1356, in _do_call
    return fn(*args)
  File "/path/to/site-packages/tensorflow/python/client/session.py", line 1339, in _run_fn
    self._extend_graph()
  File "/path/to/site-packages/tensorflow/python/client/session.py", line 1374, in _extend_graph
    tf_session.ExtendSession(self._session)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot assign a device for operation ROI/map/while/Identity_1: Could not satisfy explicit device specification '' because the node {{colocation_node ROI/map/while/Identity_1}} was colocated with a group of nodes that required incompatible device '/job:localhost/replica:0/task:0/device:GPU:0'. All available devices [/job:localhost/replica:0/task:0/device:CPU:0, /job:localhost/replica:0/task:0/device:XLA_GPU:0, /job:localhost/replica:0/task:0/device:XLA_CPU:0, /job:localhost/replica:0/task:0/device:GPU:0].
Colocation Debug Info:
Colocation group had the following types and supported devices:
Root Member(assigned_device_name_index_=1 requested_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' assigned_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' resource_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' supported_device_types_=[CPU] possible_devices_=[]
StridedSliceGrad: GPU CPU XLA_CPU XLA_GPU
NextIteration: GPU CPU XLA_CPU XLA_GPU
Mul: GPU CPU XLA_CPU XLA_GPU
Equal: GPU CPU XLA_CPU XLA_GPU
DynamicStitch: GPU CPU XLA_CPU XLA_GPU
Fill: GPU CPU XLA_CPU XLA_GPU
FloorMod: GPU CPU XLA_CPU XLA_GPU
Shape: GPU CPU XLA_CPU XLA_GPU
Reshape: GPU CPU XLA_CPU XLA_GPU
TensorArrayReadV3: GPU CPU XLA_CPU XLA_GPU
TensorArrayScatterV3: GPU CPU XLA_CPU XLA_GPU
TensorArraySizeV3: GPU CPU XLA_CPU XLA_GPU
Const: GPU CPU XLA_CPU XLA_GPU
TensorArrayWriteV3: GPU CPU XLA_CPU XLA_GPU
Identity: GPU CPU XLA_CPU XLA_GPU
GreaterEqual: GPU CPU XLA_CPU XLA_GPU
Exit: GPU CPU XLA_CPU XLA_GPU
Cast: GPU CPU XLA_CPU XLA_GPU
ControlTrigger: GPU CPU XLA_CPU XLA_GPU
TensorArrayGradV3: GPU CPU XLA_CPU XLA_GPU
Pack: GPU CPU XLA_CPU XLA_GPU
Enter: GPU CPU XLA_CPU XLA_GPU
TensorArrayV3: GPU CPU XLA_CPU XLA_GPU
Merge: GPU CPU XLA_CPU XLA_GPU
StackV2: GPU CPU XLA_CPU XLA_GPU
Range: GPU CPU XLA_CPU XLA_GPU
TensorArrayGatherV3: GPU CPU XLA_CPU XLA_GPU
StackPushV2: GPU CPU XLA_CPU XLA_GPU
Switch: GPU CPU XLA_CPU XLA_GPU
RealDiv: GPU CPU XLA_CPU XLA_GPU
Add: GPU CPU XLA_CPU XLA_GPU
StridedSlice: GPU CPU XLA_CPU XLA_GPU
Max: GPU CPU XLA_CPU XLA_GPU
LoopCond: GPU CPU XLA_CPU XLA_GPU
Sum: GPU CPU XLA_CPU XLA_GPU
StackPopV2: GPU CPU XLA_CPU XLA_GPU
Sub: GPU CPU XLA_CPU XLA_GPU

Colocation members, user-requested devices, and framework assigned devices, if any:
  ROI/map/TensorArray_2 (TensorArrayV3)  framework assigned device=/job:localhost/replica:0/task:0/device:GPU:0
  ROI/map/while/Identity_1 (Identity)
  ROI/map/while/map/TensorArray_1 (TensorArrayV3)  framework assigned device=/job:localhost/replica:0/task:0/device:GPU:0
  ROI/map/while/map/while/Identity_1 (Identity)
  ROI/map/while/map/while/strided_slice_4/stack (Pack)
  ROI/map/while/map/while/strided_slice_4/stack_1 (Pack)
  ROI/map/while/map/while/strided_slice_4 (StridedSlice)
  ROI/map/while/map/while/Max (Max)
  ROI/map/while/map/while/TensorArrayWrite/TensorArrayWriteV3/Enter (Enter)
  ROI/map/while/map/while/TensorArrayWrite/TensorArrayWriteV3 (TensorArrayWriteV3)  framework assigned device=/job:localhost/replica:0/task:0/device:GPU:0
  ROI/map/while/map/while/Exit_2 (Exit)
  ROI/map/while/map/TensorArrayStack/TensorArraySizeV3 (TensorArraySizeV3)  framework assigned device=/job:localhost/replica:0/task:0/device:GPU:0
  ROI/map/while/map/TensorArrayStack/range/start (Const)
  ROI/map/while/map/TensorArrayStack/range/delta (Const)
  ROI/map/while/map/TensorArrayStack/range (Range)
  ROI/map/while/map/TensorArrayStack/TensorArrayGatherV3 (TensorArrayGatherV3)  framework assigned device=/job:localhost/replica:0/task:0/device:GPU:0
  ROI/map/while/TensorArrayWrite/TensorArrayWriteV3/Enter (Enter)
  ROI/map/while/TensorArrayWrite/TensorArrayWriteV3 (TensorArrayWriteV3)  framework assigned device=/job:localhost/replica:0/task:0/device:GPU:0
  ROI/map/TensorArrayStack/TensorArraySizeV3 (TensorArraySizeV3)  framework assigned device=/job:localhost/replica:0/task:0/device:GPU:0
  ROI/map/TensorArrayStack/range/start (Const)
  ROI/map/TensorArrayStack/range/delta (Const)
  ROI/map/TensorArrayStack/range (Range)
  ROI/map/TensorArrayStack/TensorArrayGatherV3 (TensorArrayGatherV3)  framework assigned device=/job:localhost/replica:0/task:0/device:GPU:0
  training/MultiplierWrapper/gradients/f_count_3 (Const)
  training/MultiplierWrapper/gradients/f_count_4 (Enter)
  training/MultiplierWrapper/gradients/Merge_2 (Merge)
  training/MultiplierWrapper/gradients/Switch_2 (Switch)
  training/MultiplierWrapper/gradients/Add_1/y (Const)
  training/MultiplierWrapper/gradients/Add_1 (Add)
  training/MultiplierWrapper/gradients/f_count_5 (Exit)
  training/MultiplierWrapper/gradients/Const (Const)
  training/MultiplierWrapper/gradients/f_acc (StackV2)
  training/MultiplierWrapper/gradients/Enter (Enter)
  training/MultiplierWrapper/gradients/StackPushV2 (StackPushV2)
  training/MultiplierWrapper/gradients/StackPopV2/Enter (Enter)
  training/MultiplierWrapper/gradients/StackPopV2 (StackPopV2)
  training/MultiplierWrapper/gradients/b_count_4 (Const)
  training/MultiplierWrapper/gradients/b_count_5 (Enter)
  training/MultiplierWrapper/gradients/Merge_3 (Merge)
  training/MultiplierWrapper/gradients/GreaterEqual_1/Enter (Enter)
  training/MultiplierWrapper/gradients/GreaterEqual_1 (GreaterEqual)
  training/MultiplierWrapper/gradients/b_count_6 (LoopCond)
  training/MultiplierWrapper/gradients/Switch_3 (Switch)
  training/MultiplierWrapper/gradients/Sub_1 (Sub)
  training/MultiplierWrapper/gradients/b_count_7 (Exit)
  training/MultiplierWrapper/gradients/ROI/map/TensorArrayStack/TensorArrayGatherV3_grad/TensorArrayGrad/TensorArrayGradV3 (TensorArrayGradV3)
  training/MultiplierWrapper/gradients/ROI/map/TensorArrayStack/TensorArrayGatherV3_grad/TensorArrayGrad/gradient_flow (Identity)
  training/MultiplierWrapper/gradients/ROI/map/TensorArrayStack/TensorArrayGatherV3_grad/TensorArrayScatter/TensorArrayScatterV3 (TensorArrayScatterV3)
  training/MultiplierWrapper/gradients/ROI/map/while/TensorArrayWrite/TensorArrayWriteV3_grad/TensorArrayGrad/TensorArrayGradV3/Enter (Enter)
  training/MultiplierWrapper/gradients/ROI/map/while/TensorArrayWrite/TensorArrayWriteV3_grad/TensorArrayGrad/TensorArrayGradV3 (TensorArrayGradV3)
  training/MultiplierWrapper/gradients/ROI/map/while/TensorArrayWrite/TensorArrayWriteV3_grad/TensorArrayGrad/gradient_flow (Identity)
  training/MultiplierWrapper/gradients/ROI/map/while/TensorArrayWrite/TensorArrayWriteV3_grad/TensorArrayReadV3/Const (Const)
  training/MultiplierWrapper/gradients/ROI/map/while/TensorArrayWrite/TensorArrayWriteV3_grad/TensorArrayReadV3/f_acc (StackV2)
  training/MultiplierWrapper/gradients/ROI/map/while/TensorArrayWrite/TensorArrayWriteV3_grad/TensorArrayReadV3/Enter (Enter)
  training/MultiplierWrapper/gradients/ROI/map/while/TensorArrayWrite/TensorArrayWriteV3_grad/TensorArrayReadV3/StackPushV2 (StackPushV2)
  training/MultiplierWrapper/gradients/ROI/map/while/TensorArrayWrite/TensorArrayWriteV3_grad/TensorArrayReadV3/StackPopV2/Enter (Enter)
  training/MultiplierWrapper/gradients/ROI/map/while/TensorArrayWrite/TensorArrayWriteV3_grad/TensorArrayReadV3/StackPopV2 (StackPopV2)
  training/MultiplierWrapper/gradients/ROI/map/while/TensorArrayWrite/TensorArrayWriteV3_grad/TensorArrayReadV3 (TensorArrayReadV3)
  training/MultiplierWrapper/gradients/ROI/map/while/map/TensorArrayStack/TensorArrayGatherV3_grad/TensorArrayGrad/TensorArrayGradV3/Const (Const)
  training/MultiplierWrapper/gradients/ROI/map/while/map/TensorArrayStack/TensorArrayGatherV3_grad/TensorArrayGrad/TensorArrayGradV3/f_acc (StackV2)
  training/MultiplierWrapper/gradients/ROI/map/while/map/TensorArrayStack/TensorArrayGatherV3_grad/TensorArrayGrad/TensorArrayGradV3/Enter (Enter)
  training/MultiplierWrapper/gradients/ROI/map/while/map/TensorArrayStack/TensorArrayGatherV3_grad/TensorArrayGrad/TensorArrayGradV3/StackPushV2 (StackPushV2)
  training/MultiplierWrapper/gradients/ROI/map/while/map/TensorArrayStack/TensorArrayGatherV3_grad/TensorArrayGrad/TensorArrayGradV3/StackPopV2/Enter (Enter)
  training/MultiplierWrapper/gradients/ROI/map/while/map/TensorArrayStack/TensorArrayGatherV3_grad/TensorArrayGrad/TensorArrayGradV3/StackPopV2 (StackPopV2)
  training/MultiplierWrapper/gradients/ROI/map/while/map/TensorArrayStack/TensorArrayGatherV3_grad/TensorArrayGrad/TensorArrayGradV3/Const_1 (Const)
  training/MultiplierWrapper/gradients/ROI/map/while/map/TensorArrayStack/TensorArrayGatherV3_grad/TensorArrayGrad/TensorArrayGradV3/f_acc_1 (StackV2)
  training/MultiplierWrapper/gradients/ROI/map/while/map/TensorArrayStack/TensorArrayGatherV3_grad/TensorArrayGrad/TensorArrayGradV3/Enter_1 (Enter)
  training/MultiplierWrapper/gradients/ROI/map/while/map/TensorArrayStack/TensorArrayGatherV3_grad/TensorArrayGrad/TensorArrayGradV3/StackPushV2_1 (StackPushV2)
  training/MultiplierWrapper/gradients/ROI/map/while/map/TensorArrayStack/TensorArrayGatherV3_grad/TensorArrayGrad/TensorArrayGradV3/StackPopV2_1/Enter (Enter)
  training/MultiplierWrapper/gradients/ROI/map/while/map/TensorArrayStack/TensorArrayGatherV3_grad/TensorArrayGrad/TensorArrayGradV3/StackPopV2_1 (StackPopV2)
  training/MultiplierWrapper/gradients/ROI/map/while/map/TensorArrayStack/TensorArrayGatherV3_grad/TensorArrayGrad/TensorArrayGradV3 (TensorArrayGradV3)
  training/MultiplierWrapper/gradients/ROI/map/while/map/TensorArrayStack/TensorArrayGatherV3_grad/TensorArrayGrad/gradient_flow (Identity)
  training/MultiplierWrapper/gradients/ROI/map/while/map/TensorArrayStack/TensorArrayGatherV3_grad/TensorArrayScatter/TensorArrayScatterV3/Const (Const)
  training/MultiplierWrapper/gradients/ROI/map/while/map/TensorArrayStack/TensorArrayGatherV3_grad/TensorArrayScatter/TensorArrayScatterV3/f_acc (StackV2)
  training/MultiplierWrapper/gradients/ROI/map/while/map/TensorArrayStack/TensorArrayGatherV3_grad/TensorArrayScatter/TensorArrayScatterV3/Enter (Enter)
  training/MultiplierWrapper/gradients/ROI/map/while/map/TensorArrayStack/TensorArrayGatherV3_grad/TensorArrayScatter/TensorArrayScatterV3/StackPushV2 (StackPushV2)
  training/MultiplierWrapper/gradients/ROI/map/while/map/TensorArrayStack/TensorArrayGatherV3_grad/TensorArrayScatter/TensorArrayScatterV3/StackPopV2/Enter (Enter)
  training/MultiplierWrapper/gradients/ROI/map/while/map/TensorArrayStack/TensorArrayGatherV3_grad/TensorArrayScatter/TensorArrayScatterV3/StackPopV2 (StackPopV2)
  training/MultiplierWrapper/gradients/ROI/map/while/map/TensorArrayStack/TensorArrayGatherV3_grad/TensorArrayScatter/TensorArrayScatterV3 (TensorArrayScatterV3)
  training/MultiplierWrapper/gradients/ROI/map/while/map/while/Exit_2_grad/b_exit (Enter)
  training/MultiplierWrapper/gradients/ROI/map/while/map/while/TensorArrayWrite/TensorArrayWriteV3_grad/TensorArrayGrad/TensorArrayGradV3/Const (Const)
  training/MultiplierWrapper/gradients/ROI/map/while/map/while/TensorArrayWrite/TensorArrayWriteV3_grad/TensorArrayGrad/TensorArrayGradV3/f_acc (StackV2)
  training/MultiplierWrapper/gradients/ROI/map/while/map/while/TensorArrayWrite/TensorArrayWriteV3_grad/TensorArrayGrad/TensorArrayGradV3/Enter (Enter)
  training/MultiplierWrapper/gradients/ROI/map/while/map/while/TensorArrayWrite/TensorArrayWriteV3_grad/TensorArrayGrad/TensorArrayGradV3/StackPushV2 (StackPushV2)
  training/MultiplierWrapper/gradients/ROI/map/while/map/while/TensorArrayWrite/TensorArrayWriteV3_grad/TensorArrayGrad/TensorArrayGradV3/StackPopV2/Enter (Enter)
  training/MultiplierWrapper/gradients/ROI/map/while/map/while/TensorArrayWrite/TensorArrayWriteV3_grad/TensorArrayGrad/TensorArrayGradV3/StackPopV2 (StackPopV2)
  training/MultiplierWrapper/gradients/b_sync (ControlTrigger)
  training/MultiplierWrapper/gradients/ROI/map/while/map/while/TensorArrayWrite/TensorArrayWriteV3_grad/TensorArrayGrad/TensorArrayGradV3/Enter_1 (Enter)
  training/MultiplierWrapper/gradients/ROI/map/while/map/while/TensorArrayWrite/TensorArrayWriteV3_grad/TensorArrayGrad/TensorArrayGradV3 (TensorArrayGradV3)
  training/MultiplierWrapper/gradients/ROI/map/while/map/while/TensorArrayWrite/TensorArrayWriteV3_grad/TensorArrayGrad/gradient_flow (Identity)
  training/MultiplierWrapper/gradients/ROI/map/while/map/while/TensorArrayWrite/TensorArrayWriteV3_grad/TensorArrayReadV3/Const (Const)
  training/MultiplierWrapper/gradients/ROI/map/while/map/while/TensorArrayWrite/TensorArrayWriteV3_grad/TensorArrayReadV3/f_acc (StackV2)
  training/MultiplierWrapper/gradients/ROI/map/while/map/while/TensorArrayWrite/TensorArrayWriteV3_grad/TensorArrayReadV3/Enter (Enter)
  training/MultiplierWrapper/gradients/ROI/map/while/map/while/TensorArrayWrite/TensorArrayWriteV3_grad/TensorArrayReadV3/Enter_1 (Enter)
  training/MultiplierWrapper/gradients/ROI/map/while/map/while/TensorArrayWrite/TensorArrayWriteV3_grad/TensorArrayReadV3/StackPushV2 (StackPushV2)
  training/MultiplierWrapper/gradients/ROI/map/while/map/while/TensorArrayWrite/TensorArrayWriteV3_grad/TensorArrayReadV3/StackPopV2/Enter (Enter)
  training/MultiplierWrapper/gradients/ROI/map/while/map/while/TensorArrayWrite/TensorArrayWriteV3_grad/TensorArrayReadV3/StackPopV2/Enter_1 (Enter)
  training/MultiplierWrapper/gradients/ROI/map/while/map/while/TensorArrayWrite/TensorArrayWriteV3_grad/TensorArrayReadV3/StackPopV2 (StackPopV2)
  training/MultiplierWrapper/gradients/ROI/map/while/map/while/TensorArrayWrite/TensorArrayWriteV3_grad/TensorArrayReadV3 (TensorArrayReadV3)
  training/MultiplierWrapper/gradients/ROI/map/while/map/while/Max_grad/Shape (Shape)
  training/MultiplierWrapper/gradients/ROI/map/while/map/while/Max_grad/Size (Const)
  training/MultiplierWrapper/gradients/ROI/map/while/map/while/Max_grad/add/Const (Const)
  training/MultiplierWrapper/gradients/ROI/map/while/map/while/Max_grad/add (Add)
  training/MultiplierWrapper/gradients/ROI/map/while/map/while/Max_grad/mod (FloorMod)
  training/MultiplierWrapper/gradients/ROI/map/while/map/while/Max_grad/Shape_1 (Const)
  training/MultiplierWrapper/gradients/ROI/map/while/map/while/Max_grad/range/start (Const)
  training/MultiplierWrapper/gradients/ROI/map/while/map/while/Max_grad/range/delta (Const)
  training/MultiplierWrapper/gradients/ROI/map/while/map/while/Max_grad/range (Range)
  training/MultiplierWrapper/gradients/ROI/map/while/map/while/Max_grad/Fill/value (Const)
  training/MultiplierWrapper/gradients/ROI/map/while/map/while/Max_grad/Fill (Fill)
  training/MultiplierWrapper/gradients/ROI/map/while/map/while/Max_grad/DynamicStitch/Const (Const)
  training/MultiplierWrapper/gradients/ROI/map/while/map/while/Max_grad/DynamicStitch/f_acc (StackV2)
  training/MultiplierWrapper/gradients/ROI/map/while/map/while/Max_grad/DynamicStitch/Enter (Enter)
  training/MultiplierWrapper/gradients/ROI/map/while/map/while/Max_grad/DynamicStitch/Enter_1 (Enter)
  training/MultiplierWrapper/gradients/ROI/map/while/map/while/Max_grad/DynamicStitch/StackPushV2 (StackPushV2)
  training/MultiplierWrapper/gradients/ROI/map/while/map/while/Max_grad/DynamicStitch/StackPopV2/Enter (Enter)
  training/MultiplierWrapper/gradients/ROI/map/while/map/while/Max_grad/DynamicStitch/StackPopV2/Enter_1 (Enter)
  training/MultiplierWrapper/gradients/ROI/map/while/map/while/Max_grad/DynamicStitch/StackPopV2 (StackPopV2)
  training/MultiplierWrapper/gradients/ROI/map/while/map/while/Max_grad/DynamicStitch (DynamicStitch)
  training/MultiplierWrapper/gradients/ROI/map/while/map/while/Max_grad/Reshape/Const (Const)
  training/MultiplierWrapper/gradients/ROI/map/while/map/while/Max_grad/Reshape/f_acc (StackV2)
  training/MultiplierWrapper/gradients/ROI/map/while/map/while/Max_grad/Reshape/Enter (Enter)
  training/MultiplierWrapper/gradients/ROI/map/while/map/while/Max_grad/Reshape/Enter_1 (Enter)
  training/MultiplierWrapper/gradients/ROI/map/while/map/while/Max_grad/Reshape/StackPushV2 (StackPushV2)
  training/MultiplierWrapper/gradients/ROI/map/while/map/while/Max_grad/Reshape/StackPopV2/Enter (Enter)
  training/MultiplierWrapper/gradients/ROI/map/while/map/while/Max_grad/Reshape/StackPopV2/Enter_1 (Enter)
  training/MultiplierWrapper/gradients/ROI/map/while/map/while/Max_grad/Reshape/StackPopV2 (StackPopV2)
  training/MultiplierWrapper/gradients/ROI/map/while/map/while/Max_grad/Reshape (Reshape)
  training/MultiplierWrapper/gradients/ROI/map/while/map/while/Max_grad/Reshape_1 (Reshape)
  training/MultiplierWrapper/gradients/ROI/map/while/map/while/Max_grad/Equal/Const (Const)
  training/MultiplierWrapper/gradients/ROI/map/while/map/while/Max_grad/Equal/f_acc (StackV2)
  training/MultiplierWrapper/gradients/ROI/map/while/map/while/Max_grad/Equal/Enter (Enter)
  training/MultiplierWrapper/gradients/ROI/map/while/map/while/Max_grad/Equal/Enter_1 (Enter)
  training/MultiplierWrapper/gradients/ROI/map/while/map/while/Max_grad/Equal/StackPushV2 (StackPushV2)
  training/MultiplierWrapper/gradients/ROI/map/while/map/while/Max_grad/Equal/StackPopV2/Enter (Enter)
  training/MultiplierWrapper/gradients/ROI/map/while/map/while/Max_grad/Equal/StackPopV2/Enter_1 (Enter)
  training/MultiplierWrapper/gradients/ROI/map/while/map/while/Max_grad/Equal/StackPopV2 (StackPopV2)
  training/MultiplierWrapper/gradients/ROI/map/while/map/while/Max_grad/Equal (Equal)
  training/MultiplierWrapper/gradients/ROI/map/while/map/while/Max_grad/Cast (Cast)
  training/MultiplierWrapper/gradients/ROI/map/while/map/while/Max_grad/Sum (Sum)
  training/MultiplierWrapper/gradients/ROI/map/while/map/while/Max_grad/Reshape_2 (Reshape)
  training/MultiplierWrapper/gradients/ROI/map/while/map/while/Max_grad/truediv (RealDiv)
  training/MultiplierWrapper/gradients/ROI/map/while/map/while/Max_grad/mul (Mul)
  training/MultiplierWrapper/gradients/ROI/map/while/map/while/strided_slice_4_grad/Shape (Const)
  training/MultiplierWrapper/gradients/ROI/map/while/map/while/strided_slice_4_grad/StridedSliceGrad/Const (Const)
  training/MultiplierWrapper/gradients/ROI/map/while/map/while/strided_slice_4_grad/StridedSliceGrad/f_acc (StackV2)
  training/MultiplierWrapper/gradients/ROI/map/while/map/while/strided_slice_4_grad/StridedSliceGrad/Enter (Enter)
  training/MultiplierWrapper/gradients/ROI/map/while/map/while/strided_slice_4_grad/StridedSliceGrad/Enter_1 (Enter)
  training/MultiplierWrapper/gradients/ROI/map/while/map/while/strided_slice_4_grad/StridedSliceGrad/StackPushV2 (StackPushV2)
  training/MultiplierWrapper/gradients/ROI/map/while/map/while/strided_slice_4_grad/StridedSliceGrad/StackPopV2/Enter (Enter)
  training/MultiplierWrapper/gradients/ROI/map/while/map/while/strided_slice_4_grad/StridedSliceGrad/StackPopV2/Enter_1 (Enter)
  training/MultiplierWrapper/gradients/ROI/map/while/map/while/strided_slice_4_grad/StridedSliceGrad/StackPopV2 (StackPopV2)
  training/MultiplierWrapper/gradients/ROI/map/while/map/while/strided_slice_4_grad/StridedSliceGrad/Const_1 (Const)
  training/MultiplierWrapper/gradients/ROI/map/while/map/while/strided_slice_4_grad/StridedSliceGrad/f_acc_1 (StackV2)
  training/MultiplierWrapper/gradients/ROI/map/while/map/while/strided_slice_4_grad/StridedSliceGrad/Enter_2 (Enter)
  training/MultiplierWrapper/gradients/ROI/map/while/map/while/strided_slice_4_grad/StridedSliceGrad/Enter_3 (Enter)
  training/MultiplierWrapper/gradients/ROI/map/while/map/while/strided_slice_4_grad/StridedSliceGrad/StackPushV2_1 (StackPushV2)
  training/MultiplierWrapper/gradients/NextIteration_2 (NextIteration)
  training/MultiplierWrapper/gradients/ROI/map/while/map/while/strided_slice_4_grad/StridedSliceGrad/StackPopV2_1/Enter (Enter)
  training/MultiplierWrapper/gradients/ROI/map/while/map/while/strided_slice_4_grad/StridedSliceGrad/StackPopV2_1/Enter_1 (Enter)
  training/MultiplierWrapper/gradients/ROI/map/while/map/while/strided_slice_4_grad/StridedSliceGrad/StackPopV2_1 (StackPopV2)
  training/MultiplierWrapper/gradients/ROI/map/while/map/while/TensorArrayWrite/TensorArrayWriteV3_grad/TensorArrayReadV3/b_sync (ControlTrigger)
  training/MultiplierWrapper/gradients/NextIteration_3 (NextIteration)
  training/MultiplierWrapper/gradients/ROI/map/while/map/while/strided_slice_4_grad/StridedSliceGrad/Const_2 (Const)
  training/MultiplierWrapper/gradients/ROI/map/while/map/while/strided_slice_4_grad/StridedSliceGrad (StridedSliceGrad)

         [[{{node ROI/map/while/Identity_1}}]]

About this issue

Original URL
State: closed
Created 5 years ago
Reactions: 2
Comments: 19 (7 by maintainers)

Most upvoted comments

    history = model.fit_generator(TrainSequence(data_set),
                          epochs=1,
                          use_multiprocessing=False,
                          workers=1)
     if os.path.exists(weights_file):
        model.load_weights("new.h5")
     history = model.fit_generator(TrainSequence(data_set),
                          epochs=10,
                          use_multiprocessing=False,
                          workers=1)

Pretty sure it would work

ekuznetsov139 on Aug 23, 2019

Okay, I think I see the big picture. We have a graph containing an operation that can be placed on the GPU, but whose gradient can only be placed on the CPU.

When constructed normally via model.compile/model.fit, it works because the graph is completely constructed in one go.

When we load weights, the graph is partially constructed inside K.batch_set_value, using only a subset of the nodes. Since gradient ops are still missing, the placer has no problem placing everything on the GPU.

When we then try to train, we add the remaining ops, but we retain the device designations made during the partial construction, which are incompatible with the full construction.

Or something like that. 😃

P.S. And I found where exactly device designations are being retained. Right here: https://github.com/tensorflow/tensorflow/blob/ee16fcac960ae660e0e4496658a366e2f745e1f0/tensorflow/core/common_runtime/graph_execution_state.cc#L243

After commenting out that line as well as these https://github.com/tensorflow/tensorflow/blob/ee16fcac960ae660e0e4496658a366e2f745e1f0/tensorflow/core/common_runtime/direct_session.cc#L1502-L1505 , and then rebuilding, the error goes away! ( Of course, those lines are probably there for a reason and simply axing them in the live repository is probably not the optimal solution … )

ekuznetsov139 on Aug 22, 2019

More information:

During placement, in core/common_runtime/colocation_graph.cc, placer creates node 93:

{name:'roi_layer/map/while/map/TensorArray_1' 
id:93 
op device:{} 
def:{{{node roi_layer/map/while/map/TensorArray_1}} = TensorArrayV3[clear_after_read=
true, dtype=DT_FLOAT, dynamic_size=false, element_shape=<unknown>, identical_element_shapes=true, tensor_array_name=""](roi_layer/map/while/map/strided_slice)}}

It then goes through a series of optimizations and merges, which I haven’t quite traced, and, as a result of which, the corresponding Member object ends up with requested_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' assigned_device_name_='/job:localhost/replica:0/task:0/device:GPU:0'.

At this point, all backprop ops are missing and they aren’t added to the graph till much later.

Eventually, it assigns node 588, which is, I think, a gradient of op 93:

{name:'training/SGD/gradients/roi_layer/map/while/map/TensorArrayStack/TensorArrayGatherV3_grad/TensorArrayGrad/TensorArrayGradV3/StackPushV2' 
id:588 
op device:{} 
def:{{{node training/SGD/gradients/roi_layer/map/while/map/TensorArrayStack/TensorArrayGatherV3_grad/TensorArrayGrad/TensorArrayGradV3/StackPushV2}} = StackPushV2[T=DT_RESOURCE, _class=["loc:@roi_layer/map/while/map/TensorArray_1"], swap_memory=false](training/SGD/gradients/roi_layer/map/while/map/TensorArrayStack/TensorArrayGatherV3_grad/TensorArrayGrad/TensorArrayGradV3/Enter, roi_layer/map/while/map/TensorArray_1, ^training/SGD/gradients/Add)}}

But that one can’t be placed on the GPU, because the corresponding GPU kernel seems to be missing. But it has _class=["loc:@roi_layer/map/while/map/TensorArray_1"], which seems to say that it either can be, or must be, colocated with 93.

A while later, placer does colocate 93 and 588. And now we end up with a Member object that was requested to be on the GPU, but can only be placed on the CPU. (Because colocation logic does not check for this possibility.)

Then, quite a while later, the problem is noticed and reported in a singularly unhelpful error message.

At this point I’m not sure if the bug is in the placer or in the savefile loader. It looks like loading the savefile somehow messes up the model. The error seems to go away if I replace K.batch_set_value(weight_value_tuples) with direct

    symbolic_weights[0].assign(weight_values[0])
    symbolic_weights[1].assign(weight_values[1])

in keras/saving/hdf5_format.py. (Though I’m not sure if it cures the problem or just messes up the loading.)

ekuznetsov139 on Aug 22, 2019

I can reproduce it with a 1080 Ti.

The problem is not a race condition. Part of the constructed graph is not GPU-friendly and it has to be placed on the CPU. The placer manages to work out a good arrangement if the model has not been loaded any weights. Otherwise, something somewhere takes the wrong turn and the placer arrives to a pathological state.

I tried to compare graph dumps with and without weight load, but they are exceedingly complicated and the reason for the mismatch is not apparent. Although it looks like the run with weight load is missing all the backprop-related nodes until a late point. For example, without weight load, placer_input_2.pbtxt is 325 kb and it has lots of “training/…” nodes. With weight load, placer_input_2.pbtxt is 116 kb and it has no “training” nodes at all. (They only appear in placer_input_4.pbtxt.) In placer_output_2.pbtxt, the node roi_layer/map/while/TensorArrayWrite/TensorArrayWriteV3 gets assigned to the CPU without weight load and to the GPU with weight load.

ekuznetsov139 on Aug 21, 2019