transformers: RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

System Info

  • transformers version: 4.30.2
  • Platform: Linux-5.15.120±x86_64-with-glibc2.31
  • Python version: 3.10.12
  • Huggingface_hub version: 0.16.4
  • Safetensors version: 0.3.1
  • PyTorch version (GPU?): 2.0.0+cpu (False)
  • Tensorflow version (GPU?): 2.12.0 (False)
  • Flax version (CPU?/GPU?/TPU?): 0.7.0 (cpu)
  • Jax version: 0.4.13
  • JaxLib version: 0.4.13
  • Using GPU in script?: <fill in>
  • Using distributed or parallel set-up in script?: <fill in>

Who can help?

@ArthurZucker and @younesbelkada

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, …)
  • My own task or dataset (give details below)

Reproduction

I’m trying to make a Sarcasm detector with Lightning in this Kaggle notebook.

When I start the training, I get this error: RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

This is my LightningModule:

class SarcasmTagger(pl.LightningModule):

    def __init__(
        self, 
        model_name: str, 
        n_classes: int, 
        n_training_steps=None, 
        n_warmup_steps=None
    ):
        super().__init__()
        self.bert = BertModel.from_pretrained(model_name, return_dict=True)
        #self.bert =  BertForSequenceClassification.from_pretrained(model_name, return_dict=True)
        self.classifier = nn.Linear(self.bert.config.hidden_size, n_classes)
        self.n_training_steps = n_training_steps
        self.n_warmup_steps = n_warmup_steps

    def forward(self, input_ids, attention_mask):
        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
        #print(outputs)
        logits = self.classifier(outputs.pooler_output)
        return logits
    
    def shared_step(self, batch, batch_idx):
        input_ids = batch["input_ids"]
        attention_mask = batch["attention_mask"]
        label = batch["label"].view(-1, 1)
        logits = self(input_ids=input_ids, attention_mask=attention_mask)
        loss = nn.functional.cross_entropy(logits, label)
        return logits, loss, label
        

    def training_step(self, batch, batch_idx):
        logits, loss, label = self.shared_step(batch, batch_idx)
        self.log("train_loss", loss, prog_bar=True, logger=True)
        return {"loss": loss, "predictions": logits, "label": label}

    def validation_step(self, batch, batch_idx):
        logits, loss, label = self.shared_step(batch, batch_idx)
        self.log("val_loss", loss, prog_bar=True, logger=True)
        return loss

    def test_step(self, batch, batch_idx):
        logits, loss, label = self.shared_step(batch, batch_idx)
        self.log("test_loss", loss, prog_bar=True, logger=True)
        return loss

    def configure_optimizers(self):
        optimizer = AdamW(self.parameters(), lr=2e-5)

        scheduler = get_linear_schedule_with_warmup(
          optimizer,
          num_warmup_steps=self.n_warmup_steps,
          num_training_steps=self.n_training_steps
        )

        return dict(
            optimizer=optimizer,
            lr_scheduler=dict(
                scheduler=scheduler,
                interval='step')
        )

What is the problem here? I’m lost.

Thanks!

Expected behavior

Execute the training without errors.

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 17 (2 by maintainers)

Most upvoted comments

some more details:

These combinations work:

  • torch==2.0.0+cu117, pytorch-lightning==1.9.4, accelerate==0.21.0, tokenizers==0.13.3, transformers==4.26.1
  • torch==2.0.0+cu117, pytorch-lightning==1.9.4, accelerate==0.21.0, tokenizers==0.13.3, transformers==4.27.4
  • torch==2.0.0+cu117, pytorch-lightning==1.9.4, accelerate==0.21.0, tokenizers==0.13.3, transformers==4.28.1
  • torch==2.0.0+cu117, pytorch-lightning==1.9.4, accelerate==0.21.0, tokenizers==0.13.3, transformers==4.29.2

These combinations don’t:

  • torch==2.0.0+cu117, pytorch-lightning==1.9.4, accelerate==0.21.0, tokenizers==0.13.3, transformers==4.30.0
  • torch==2.0.0+cu117, pytorch-lightning==1.9.4, accelerate==0.21.0, tokenizers==0.13.3, transformers==4.30.2
  • torch==2.0.0+cu117, pytorch-lightning==1.9.4, accelerate==0.21.0, tokenizers==0.13.3, transformers==4.31.0

So the regression must have been introduced in transformers==4.30.0?

I’ll try to see if I can get a minimal reproducing script together.

Same problem here, as suggested, it was resolved with the switch of optimizers

Hi all, the default has been changed on main now and will populate on the next release. Install with pip install git+https://github.com/huggingface/transformers to use it OOTB!

If our AdamW is not working properly, all the more reasons to switch the default to the PyTorch one. Users will still be able to switch back if they do not like the change.

Not entirely sure this is worth looking into too much, given @stas00 point here: https://github.com/huggingface/transformers/pull/23417#issuecomment-1550506298

This is a very old and deprecated implementation since it doesn’t even follow the AdamW algorithm exactly. One should use torch.optim.AdamW instead, which also has a fused version since pt-2.0.0 which is almost as fast as apex’s fused AdamW. So really you shouldn’t be using this version anyway.

The only reason it was kept is for BC for those who rely on exact results remaining exact after new transformers versions are released, otherwise we would have just replaced it with torch.optim.AdamW in the first place.

So yes, AdamW is slated for deprecation and you should use torch.optim.AdamW. @sgugger do we know when that is going to be? Or should we look into this more.

There wasn’t anything explicit in the change to AdamW since v0.29.0, so it’ll take some digging to find the exact commit certainly.

some more details after I swapped this line of code:

from transformers import AdamW

with this line:

from torch.optim import AdamW

now all the versions of transformers I tested earlier work on my existing codebase:

  • torch==2.0.0+cu117, pytorch-lightning==1.9.4, accelerate==0.21.0, tokenizers==0.13.3, transformers==4.26.1
  • torch==2.0.0+cu117, pytorch-lightning==1.9.4, accelerate==0.21.0, tokenizers==0.13.3, transformers==4.27.4
  • torch==2.0.0+cu117, pytorch-lightning==1.9.4, accelerate==0.21.0, tokenizers==0.13.3, transformers==4.28.1
  • torch==2.0.0+cu117, pytorch-lightning==1.9.4, accelerate==0.21.0, tokenizers==0.13.3, transformers==4.29.2
  • torch==2.0.0+cu117, pytorch-lightning==1.9.4, accelerate==0.21.0, tokenizers==0.13.3, transformers==4.30.0
  • torch==2.0.0+cu117, pytorch-lightning==1.9.4, accelerate==0.21.0, tokenizers==0.13.3, transformers==4.30.2
  • torch==2.0.0+cu117, pytorch-lightning==1.9.4, accelerate==0.21.0, tokenizers==0.13.3, transformers==4.31.0

therefore, there is pretty strong evidence that something in transformers.AdamW in transformers==4.30.0 caused a regression?

thanks a lot @lcoandrade for that! 🙌 I can now upgrade our transformers dependency to the latest!

I have a similar issue.

With pytorch-lightning==1.9.4 and transformers==4.26.1 the code runs fine (and has done with previous versions of both libraries for months/years - yes there have been code changes in that time but the core has been rather stable).

(Also just tested with transformers==4.29.2 and works fine)

However, when I change nothing in the code and change no other dependencies (so pytorch-lightning==1.9.4 and all others the same) except to upgrade to transformers==4.30.2 the code fails with the error message:

RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

The problem is that my codebase is very large and it will take me a while to generate a minimal reproducing script. I will try to put this together, but in the time it takes me to do this, perhaps someone else will have a simpler solution (considering the information I am sharing) and/or a simpler minimal reproducing script.

Perhaps also @lcoandrade you could try your script with transformers==4.26.1 or transformers==4.29.2 and see if that works for you?