tensorflow: Problem about distributed training with XLA compiling.

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
- custom layer and custom training step
OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
- I have tested on Windows 10, Ubuntu 16.04, and Ubuntu 18.04.
TensorFlow installed from (source or binary):
- both
TensorFlow version (use command below):
- I have tried TF 2.4, 2.5 distributed version and source installed 2.4
Python version:
- 3.7
Bazel version (if compiling from source):
- 3.5.0
GCC/Compiler version (if compiling from source):
- 7.5
CUDA/cuDNN version:
- 10.1 and 11.0
GPU model and memory:
- 1080ti x4

Describe the current behavior When I train my model on multi-gpu with XLA compiling below error is occurred.

Training starts
Traceback (most recent call last):
  File "FFP_/train_w_pruning.py", line 76, in <module>
    train_step(*data)
  File "/home/cvip/anaconda3/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 787, in __call__
    result = self._call(*args, **kwds)
  File "/home/cvip/anaconda3/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 854, in _call
    filtered_flat_args, self._concrete_stateful_fn.captured_inputs)  # pylint: disable=protected-access
  File "/home/cvip/anaconda3/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 1920, in _call_flat
    ctx, args, cancellation_manager=cancellation_manager))
  File "/home/cvip/anaconda3/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 561, in call
    ctx=ctx)
  File "/home/cvip/anaconda3/lib/python3.7/site-packages/tensorflow/python/eager/execute.py", line 60, in quick_execute
    inputs, attrs, num_outputs)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Trying to access resource ResNet/conv/kernel/replica_1_879 located in device /job:localhost/replica:0/task:0/device:GPU:0 [Op:__inference_train_step_dist_88943]

Describe the expected behavior I want to compile my multi-gpu code but it seems unavailable.

Standalone code to reproduce the issue https://github.com/sseung0703/TF2-multi-gpu-training

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 28 (8 by maintainers)

Most upvoted comments

‘jit_compile’ is the new alias for ‘experimental_compile’

Current suggestion is to “jit_compile” only parts of training which are running independently on each replica(GPU). Any time any communication/synchronization is needed then jit_compile around that will fail. So a ‘jit_compile’ around entire strategy.run will fail, and a ‘jit_compile’ around the function containing ‘optimizer.apply_gradients’ will fail. Also any ‘jit_compile’ around functions which update metrics will fail. But ‘jit_compile’ around the function doing the main training should work.

We are looking to lower the all_reduce and resolved some of these issues. Soon folks can enable ‘jit_compile’ on the entire training step (ideally like the way the user has done in this case).

To fix their current code, user needs to change the train_step code in <their github>/main/op_utils.py from:

    @tf.function(experimental_compile = args.compile)
    def train_step(images, labels):
        with tf.GradientTape() as tape:
            pred = model(images, training = True)
            total_loss = loss_object(labels, pred)/args.batch_size
        gradients = tape.gradient(total_loss, model.trainable_variables)
        if args.weight_decay > 0.:
            gradients = [g+v*args.weight_decay for g,v in zip(gradients, model.trainable_variables)]
        optimizer.apply_gradients(zip(gradients, model.trainable_variables))

        train_loss.update_state(total_loss)
        train_accuracy.update_state(labels, pred)
   
    @tf.function(experimental_compile = args.compile)
    def train_step_dist(image, labels):
        strategy.run(train_step, args= (image, labels))

Code needs to changed to:

    @tf.function(jit_compile = True)
    def compiled_step(images, labels):
        with tf.GradientTape() as tape:
            pred = model(images, training = True)
            total_loss = loss_object(labels, pred)/args.batch_size
        gradients = tape.gradient(total_loss, model.trainable_variables)
        return total_loss, pred, gradients

    def train_step(images, labels):
        total_loss, pred, gradients = compiled_step(images, labels)
        if args.weight_decay > 0.:
            gradients = [g+v*args.weight_decay for g,v in zip(gradients, model.trainable_variables)]

        optimizer.apply_gradients(zip(gradients, model.trainable_variables))

        train_loss.update_state(total_loss)
        train_accuracy.update_state(labels, pred)
   
    @tf.function()
    def train_step_dist(image, labels):
        strategy.run(train_step, args= (image, labels))

Please let me know if you still hit issues, and we will be happy to resolve them.

nnigania on Jan 13, 2021

For runs on single-host (single or multi-GPU using mirrored-strategy) we now support XLA around optimizer.apply_gradients . So now for even better performance we can do:

    @tf.function(jit_compile=True)
    def train_step(*data):
        # no need for xla around compiled_step 
        total_loss, gradients, update_vars = compiled_step(*data)
        optimizer.apply_gradients(zip(gradients, model.trainable_variables))

+@cheshire

nnigania on Mar 22, 2022

@sseung0703 Your example does not show the whole code, since the part actually applying experimental_compile=True is not in your example.

@nnigania is currently working on supporting collective ops under XLA:GPU compilation, so I think the best bet is to wait for this work to land.

Alternatively, you would have to compile the part of your model which is not using any collectives (as we did for MLPerf).

cheshire on Jan 13, 2021