transformers: AssertionError with multiple GPU

System Info

Red Hat Server 7.7 Pytorch: 1.6.0 Transformers: 3.0.2 Python: 3.7.6 Number of GPU: 4

Question

I am trying to finetune a GPT2 model using Trainer with multiple GPU installed on my machine. However, I get the following error:

Traceback (most recent call last):
  File "run_finetune_gpt2.py", line 158, in <module>
    main()
  File "run_finetune_gpt2.py", line 145, in main
    trainer.train()
  File "/path/to/venvs/my-venv/lib/python3.6/site-packages/transformers/trainer.py", line 499, in train
    tr_loss += self._training_step(model, inputs, optimizer)
  File "/path/to/venvs/my-venv/lib/python3.6/site-packages/transformers/trainer.py", line 622, in _training_step
    outputs = model(**inputs)
  File "/path/to/venvs/my-venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/path/to/venvs/my-venv/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 156, in forward
    return self.gather(outputs, self.output_device)
  File "/path/to/venvs/my-venv/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 168, in gather
    return gather(outputs, output_device, dim=self.dim)
  File "/path/to/venvs/my-venv/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 68, in gather
    res = gather_map(outputs)
  File "/path/to/venvs/my-venv/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 63, in gather_map
    return type(out)(map(gather_map, zip(*outputs)))
  File "/path/to/venvs/my-venv/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 55, in gather_map
    return Gather.apply(target_device, dim, *outputs)
  File "/path/to/venvs/my-venv/lib/python3.6/site-packages/torch/nn/parallel/_functions.py", line 54, in forward
    assert all(map(lambda i: i.is_cuda, inputs))
AssertionError
wandb: Program failed with code 1. Press ctrl-c to abort syncing.
wandb: You can sync this run to the cloud by running:
wandb: wandb sync wandb/dryrun-20200914_134757-1sih3p0q

Any ideas about what might be going on? Thanks in advance!

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 16 (1 by maintainers)

Most upvoted comments

There was no error because the tensors were set on the only GPU you add when back from numpy but the gradients were still wrong (basically everything that happened before the numpy part was wiped out).

sgugger on Sep 22, 2020