trl: RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation

I am getting the following error traceback when I run python -m torch.distributed.launch --nproc_per_node=1 reward_summarization.py --bf16 on a machine with two nodes of A10 (24GB). I have torch==2.0.0 installed.

I appreciate any comment/idea to fix that.

Traceback (most recent call last):
  File "/home/opc/trl/examples/summarization/scripts/reward_summarization.py", line 202, in <module>
    trainer.train(script_args.resume_from_checkpoint)
  File "/home/opc/miniconda3/lib/python3.10/site-packages/transformers/trainer.py", line 1633, in train
    return inner_training_loop(
  File "/home/opc/miniconda3/lib/python3.10/site-packages/transformers/trainer.py", line 1902, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/home/opc/miniconda3/lib/python3.10/site-packages/transformers/trainer.py", line 2663, in training_step
    loss.backward()
  File "/home/opc/miniconda3/lib/python3.10/site-packages/torch/_tensor.py", line 487, in backward
    torch.autograd.backward(
  File "/home/opc/miniconda3/lib/python3.10/site-packages/torch/autograd/__init__.py", line 197, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [CUDABoolType [1, 1, 377, 377]] is at version 3; expected version 2 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).
wandb: Waiting for W&B process to finish... (failed 1).
wandb: You can sync this run to the cloud by running:
wandb: wandb sync /home/opc/trl/examples/summarization/scripts/wandb/offline-run-20230404_175237-0r3498mc
wandb: Find logs at: ./wandb/offline-run-20230404_175237-0r3498mc/logs
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1902146) of binary: /home/opc/miniconda3/bin/python
Traceback (most recent call last):
  File "/home/opc/miniconda3/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/opc/miniconda3/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/opc/miniconda3/lib/python3.10/site-packages/torch/distributed/launch.py", line 195, in <module>
    main()
  File "/home/opc/miniconda3/lib/python3.10/site-packages/torch/distributed/launch.py", line 191, in main
    launch(args)
  File "/home/opc/miniconda3/lib/python3.10/site-packages/torch/distributed/launch.py", line 176, in launch
    run(args)
  File "/home/opc/miniconda3/lib/python3.10/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/home/opc/miniconda3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/opc/miniconda3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
reward_summarization.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-04-04_17:52:47
  host      : instance-20230329-1307.subnet03291319.vcn03291319.oraclevcn.com
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 1902146)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 19 (6 by maintainers)

Most upvoted comments

I don’t have a clear understanding to the cause of this issue per se, but the problem is derived from the fact that we run two forward passes (for rewards_j and rewards_k respectively) to compute the loss function, and somehow GPT’s doesn’t like that. Here’s a minimal workaround that doesn’t involve making changes to transformers.models:

  • Replace the current RewardDataCollatorWithPadding with the following. We merge the two batches into one.
@dataclass
class RewardDataCollatorWithPadding:
    tokenizer: AutoTokenizer
    padding: Union[bool, str, PaddingStrategy] = True
    max_length: Optional[int] = None
    pad_to_multiple_of: Optional[int] = None
    return_tensors: str = "pt"

    def __call__(self, features: List[Dict[str, Any]]) -> Dict[str, Any]:
        merged_features = []
        # features_j = []
        # features_k = []
        for feature in features:
            merged_features.append(
                {
                    "input_ids": feature["input_ids_j"],
                    "attention_mask": feature["attention_mask_j"],
                }
            )
            merged_features.append(
                {
                    "input_ids": feature["input_ids_k"],
                    "attention_mask": feature["attention_mask_k"],
                }
            )
        batch = self.tokenizer.pad(
            merged_features,
            padding=self.padding,
            max_length=self.max_length,
            pad_to_multiple_of=self.pad_to_multiple_of,
            return_tensors=self.return_tensors,
        )
        batch = {
            "input_ids": batch["input_ids"],
            "attention_mask": batch["attention_mask"],
            "return_loss": True,
        }
        return batch
  • Replace the current compute_loss with the following. We split model predictions back to rewards_j and rewards_k after a single forward pass, and compute the loss function.
class RewardTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False):
        rewards = model(
            input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"]
        )[0]
        bsz = rewards.size(0)
        jidx = torch.arange(0, bsz, 2)
        kidx = jidx + 1
        rewards_j = rewards[jidx]
        rewards_k = rewards[kidx]
        loss = -nn.functional.logsigmoid(rewards_j - rewards_k).mean()
        if return_outputs:
            return loss, {"rewards_j": rewards_j, "rewards_k": rewards_k}
        return loss

This should work for GPT-2’s and GPT-NeoX’s!

Planning to deep dive in the next weeks about issues with respect to distributed training, assigning this to myself