tensorflow: Problem about distributed training with XLA compiling.
System information
- Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
- custom layer and custom training step
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
- I have tested on Windows 10, Ubuntu 16.04, and Ubuntu 18.04.
- TensorFlow installed from (source or binary):
- both
- TensorFlow version (use command below):
- I have tried TF 2.4, 2.5 distributed version and source installed 2.4
- Python version:
- 3.7
- Bazel version (if compiling from source):
- 3.5.0
- GCC/Compiler version (if compiling from source):
- 7.5
- CUDA/cuDNN version:
- 10.1 and 11.0
- GPU model and memory:
- 1080ti x4
Describe the current behavior When I train my model on multi-gpu with XLA compiling below error is occurred.
Training starts
Traceback (most recent call last):
File "FFP_/train_w_pruning.py", line 76, in <module>
train_step(*data)
File "/home/cvip/anaconda3/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 787, in __call__
result = self._call(*args, **kwds)
File "/home/cvip/anaconda3/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 854, in _call
filtered_flat_args, self._concrete_stateful_fn.captured_inputs) # pylint: disable=protected-access
File "/home/cvip/anaconda3/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 1920, in _call_flat
ctx, args, cancellation_manager=cancellation_manager))
File "/home/cvip/anaconda3/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 561, in call
ctx=ctx)
File "/home/cvip/anaconda3/lib/python3.7/site-packages/tensorflow/python/eager/execute.py", line 60, in quick_execute
inputs, attrs, num_outputs)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Trying to access resource ResNet/conv/kernel/replica_1_879 located in device /job:localhost/replica:0/task:0/device:GPU:0 [Op:__inference_train_step_dist_88943]
Describe the expected behavior I want to compile my multi-gpu code but it seems unavailable.
Standalone code to reproduce the issue https://github.com/sseung0703/TF2-multi-gpu-training
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 28 (8 by maintainers)
‘jit_compile’ is the new alias for ‘experimental_compile’
Current suggestion is to “jit_compile” only parts of training which are running independently on each replica(GPU). Any time any communication/synchronization is needed then jit_compile around that will fail. So a ‘jit_compile’ around entire strategy.run will fail, and a ‘jit_compile’ around the function containing ‘optimizer.apply_gradients’ will fail. Also any ‘jit_compile’ around functions which update metrics will fail. But ‘jit_compile’ around the function doing the main training should work.
We are looking to lower the all_reduce and resolved some of these issues. Soon folks can enable ‘jit_compile’ on the entire training step (ideally like the way the user has done in this case).
To fix their current code, user needs to change the train_step code in <their github>/main/op_utils.py from:
Code needs to changed to:
Please let me know if you still hit issues, and we will be happy to resolve them.
For runs on single-host (single or multi-GPU using mirrored-strategy) we now support XLA around optimizer.apply_gradients . So now for even better performance we can do:
+@cheshire
@sseung0703 Your example does not show the whole code, since the part actually applying
experimental_compile=Trueis not in your example.@nnigania is currently working on supporting collective ops under XLA:GPU compilation, so I think the best bet is to wait for this work to land.
Alternatively, you would have to compile the part of your model which is not using any collectives (as we did for MLPerf).