attention_sinks: Trying a minimal example with LlamaForCasualLM, sadly it fails
My minimal example:
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
from transformers import AutoTokenizer, StoppingCriteria, StoppingCriteriaList
repo = "meta-llama/Llama-2-13b-chat-hf"
tokenizer = AutoTokenizer.from_pretrained(repo)
from attention_sinks import LlamaForCausalLM
model = LlamaForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf", device_map="auto", load_in_4bit=True)
# Set the text you want to generate text based on
#text = "<s> you are hepful assistant. </s> <u> Tell me the pros and cons of coffee. Two points. </u>"
text = "<s> you are hepful assistant. </s> <u> Write me a long essay on the reasons for fall of roman empire/u>"
# Encode the text
input_ids = tokenizer.encode(text, return_tensors='pt').to(device)
# Generate text
generated_tokens = model.generate(input_ids, penalty_alpha=0.6, top_k=5, max_length=4096)
# Decode the generated text
generated_text = tokenizer.decode(generated_tokens[0], skip_special_tokens=True)
print(generated_text)
Fails here:
File [~/mambaforge/envs/data_science/lib/python3.10/site-packages/attention_sinks/models/llama/pos_shift.py:103](https://file+.vscode-resource.vscode-cdn.net/home/alexbalandi/betterwithai/personalized_assistant/notebooks/~/mambaforge/envs/data_science/lib/python3.10/site-packages/attention_sinks/models/llama/pos_shift.py:103), in llama_pos_shift_attention_forward(self, hidden_states, attention_mask, position_ids, past_key_value, output_attentions, use_cache)
101 if attention_mask is not None:
102 if attention_mask.size() != (bsz, 1, q_len, kv_seq_len):
--> 103 raise ValueError(
104 f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.size()}"
105 )
106 attn_weights = attn_weights + attention_mask
108 # upcast attention to fp32
ValueError: Attention mask should be of size (1, 1, 1, 1025), but is torch.Size([1, 1, 1, 1026])
The root of issue is clear, but trying dumb fixes (like slicing the attention mask to make it “fit”) doesn’t work. Is it at least reproducable in your env? 👀 I’d really appreciate any pointers on ways to fix this 🙏
About this issue
- Original URL
- State: closed
- Created 9 months ago
- Comments: 16 (10 by maintainers)
Resolved in 48bb293d4fb15d08bdeb3a0425cee0ea78f8ba52, thanks again for reporting
Just chiming in here to say thank you for all your hard work that makes it easier to experiment with the results of the paper, you rock 🤗
Please try the following snippet with the model of your choice and a corresponding prompt. The tokenizer here is set up to endlessly generate, so it may still eventually lose track of what it was doing, but it shouldn’t forget English like what would happen with pure
transformersor windowed attention.Feel free to experiment with #6 to get model.generate working.
I also get failures when calling
generate, although the model does work if I do the generation manually like so:It ends up writing
<u> Write me a long essay on the reasons for fall of roman empire/u>over and over for thousands of tokens (because this is not a instruct-tuned model, this is how the pure transformers model reacts too).I’ll also check what happens if I use the windowed attention approach, i.e. the green line here. Edit: See these outputs. The left is the index and the right is the output token. It completely loses the plot.
So,
attention_sinksdoes work, but not withmodel.generateat the moment. I’ll have to debug thegeneratemethod to figure out where the issue originates.Let me look into this! I haven’t tried to generate myself: i’ve only tried to directly call
forwardon theLlamaModel/FalconModelin my benchmarks.