transformers: CLIP pooling is not compatible with adding new tokens
System Info
Feature request (Duplicate of #21029)
For textual inversion in diffusers, we are adding tokens that have a higher token id than the eos token. So when we get clip embeddings for textual inv tokens, we need to change the pooling so it gets the eos token and not the arg max token.
Motivation
This is an issue that should be fixed as the clip embeddings won’t work once we add more tokens to the tokenizer. This hasn’t been a huge issue so far because most models use the hidden layers directly but the new paper on SDXL also mentions using the pooled output now.
Who can help?
No response
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, …) - My own task or dataset (give details below)
Reproduction
- Add new token to tokenizer
- Encode tokens with CLIPTextModel
- Get pooled output
Expected behavior
Pooled output considers the added token ids vs eos id instead of argmax
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 17 (2 by maintainers)
Even if we don’t use the
eos_token_id
from the config doesn’t mean nobody else does!That being said, as @patil-suraj points out, the
eos_token_id
is wrong. I don’t think it could be meaningfully or correctly used anywhere so happy for it to be updated.It makes me think we should add some tests to make sure the model and tokenizer mappings are aligned when added to the library - at least as an integration check for an example checkpoint.
Oh yes, the text encoder has the wrong one. Length - 1 might not work if you are adding more tokens. The easy solution is to expose the eos id in the encoder model so it can be changed from the outside
FYI: the inconsistency between config and the tokenier/processor is (one of) the main reason we had trouble in pipeline testing (using tiny models). I had make some extra work to avoid this problem (in the context of creating tiny models)
Thanks a lot for the issue @okaris !
IMO, updating the
eos_token_id
in the config would be better than adding a newconfig
attribute. As far as I can tell, this should not break anything becauseconfig
is never really used for tokenization, theconfig
is used to get theeos_token_id
if we are doing generation, but the CLIP model is not used for generation and also the currentconfig.eos_token_id
is incorrect, so updating theeos_token_id
should be safe. We can send a mass PR on the hub to update this (cc @patrickvonplaten )What do you think @ydshieh @patrickvonplaten ?
I might be wrong, but isn’t the new length of the (resized) embedding layer just the new
vocab_size
?@okaris
If I understand correctly, the goal is to use the (fixed) eos id, rather than using
argmax
. Is this right?