transformers: CLIP pooling is not compatible with adding new tokens

System Info

Feature request (Duplicate of #21029)

For textual inversion in diffusers, we are adding tokens that have a higher token id than the eos token. So when we get clip embeddings for textual inv tokens, we need to change the pooling so it gets the eos token and not the arg max token.

Motivation

This is an issue that should be fixed as the clip embeddings won’t work once we add more tokens to the tokenizer. This hasn’t been a huge issue so far because most models use the hidden layers directly but the new paper on SDXL also mentions using the pooled output now.

@ArthurZucker @younesbelkada

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, …)
  • My own task or dataset (give details below)

Reproduction

  1. Add new token to tokenizer
  2. Encode tokens with CLIPTextModel
  3. Get pooled output

Expected behavior

Pooled output considers the added token ids vs eos id instead of argmax

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 17 (2 by maintainers)

Most upvoted comments

Even if we don’t use the eos_token_id from the config doesn’t mean nobody else does!

That being said, as @patil-suraj points out, the eos_token_id is wrong. I don’t think it could be meaningfully or correctly used anywhere so happy for it to be updated.

It makes me think we should add some tests to make sure the model and tokenizer mappings are aligned when added to the library - at least as an integration check for an example checkpoint.

Oh yes, the text encoder has the wrong one. Length - 1 might not work if you are adding more tokens. The easy solution is to expose the eos id in the encoder model so it can be changed from the outside

FYI: the inconsistency between config and the tokenier/processor is (one of) the main reason we had trouble in pipeline testing (using tiny models). I had make some extra work to avoid this problem (in the context of creating tiny models)

Thanks a lot for the issue @okaris !

IMO, updating the eos_token_id in the config would be better than adding a new config attribute. As far as I can tell, this should not break anything because config is never really used for tokenization, the config is used to get the eos_token_id if we are doing generation, but the CLIP model is not used for generation and also the current config.eos_token_id is incorrect, so updating the eos_token_id should be safe. We can send a mass PR on the hub to update this (cc @patrickvonplaten )

What do you think @ydshieh @patrickvonplaten ?

I might be wrong, but isn’t the new length of the (resized) embedding layer just the new vocab_size?

@okaris

If I understand correctly, the goal is to use the (fixed) eos id, rather than using argmax. Is this right?