unlimiformer: TypeError: torch_replacement_knn_gpu() got an unexpected keyword argument 'device'
Hey looks like I’m having some issues working with Llama models. This is the modified script I’m using:
!python run_generation.py --model_type llama --model_name_or_path psmathur/orca_mini_3b \
--prefix "<<SYS>>\n You are a helpful assistant. Answer with detailed responses according to the entire instruction or question. \n<</SYS>>\n\n [INST] Summarize the following book: " \
--prompt example_inputs/harry_potter_full.txt \
--suffix " [/INST]" --test_unlimiformer --fp16 --length 200 --layer_begin 16 \
--index_devices 1 --datastore_device 0
But I get this error:
2023-08-14 14:28:33.395015: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
08/14/2023 14:28:35 - WARNING - __main__ - device: cuda, n_gpu: 1, 16-bits training: True
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This means that tokens that come after special tokens will not be properly handled. We recommend you to read the related pull request available at https://github.com/huggingface/transformers/pull/24565, and set the legacy attribute accordingly.
Loading checkpoint shards: 100% 3/3 [00:08<00:00, 2.95s/it]
08/14/2023 14:29:16 - INFO - __main__ - Namespace(model_type='llama', model_name_or_path='psmathur/orca_mini_3b', prompt='example_inputs/harry_potter_full.txt', length=200, num_hidden_layers=None, stop_token=None, temperature=1.0, repetition_penalty=1.0, k=0, p=0.9, prefix='<<SYS>>\\n You are a helpful assistant. Answer with detailed responses according to the entire instruction or question. \\n<</SYS>>\\n\\n [INST] Summarize the following book: ', suffix=' [/INST]', padding_text='', xlm_language='', seed=42, no_cuda=False, stream_output=False, num_return_sequences=1, fp16=True, jit=False, device=device(type='cuda'), n_gpu=1)
08/14/2023 14:29:16 - INFO - Unlimiformer - Encoding 0 to 65 out of 65
Traceback (most recent call last):
File "/content/unlimiformer/src/run_generation.py", line 577, in <module>
main()
File "/content/unlimiformer/src/run_generation.py", line 532, in main
output_sequences = model.generate(
File "/content/unlimiformer/src/unlimiformer.py", line 529, in pre_generate_hook
return self.original_generate_func(input_ids_prefix, **new_kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 1642, in generate
return self.sample(
File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 2724, in sample
outputs = self(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/content/unlimiformer/src/unlimiformer.py", line 551, in pre_forward_hook
result = self.original_forward_func(input_ids=input_ids, labels=labels, attention_mask=attention_mask, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 810, in forward
outputs = self.model(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 698, in forward
layer_outputs = decoder_layer(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 413, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/content/unlimiformer/src/unlimiformer.py", line 575, in attention_pre_forward_hook
result = original_cross_attn_forward_func(hidden_states=hidden_states, attention_mask=attention_mask, *args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 310, in forward
query_states = self.q_proj(hidden_states)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1547, in _call_impl
hook_result = hook(self, args, result)
File "/content/unlimiformer/src/unlimiformer.py", line 629, in attention_forward_hook
_, top_search_key_indices = self.datastore[datastore_index].search(datastore_query, k=topk)
File "/content/unlimiformer/src/index_building.py", line 34, in search
scores, values = self.indices[i].search(queries[i], k)
File "/content/unlimiformer/src/index_building.py", line 144, in search
scores, values = faiss.knn_gpu(faiss.StandardGpuResources(), queries, self.keys, k,
TypeError: torch_replacement_knn_gpu() got an unexpected keyword argument 'device'
Any ideas on how to fix that?
Thanks again for all the help and for the new features!
About this issue
- Original URL
- State: open
- Created a year ago
- Comments: 17
here’s the command for the 80k tokens input:
And surprisingly, I modified the instruction a little in the prefix and using a 135k tokens input to test again. Here’s the result:
Amazing! And the command for above:
But it’s weird that the model’s output for summarizing harry potter is still strange although using the same flags as the above cookbook:
(Yes, the model indeed output a lot of
\n
, maybe it’s harry’s magic I guess…)Ok finally got it to work in Google Colab on an A100 40G. For anyone curious I used StableBeluga-13B and it took around 9 minutes to get a summary of Harry Potter, which is pretty good especially since you can’t even fit the full book in Claude 100k! I’m thoroughly impressed!
Here is the Code I used to get it working in Colab:
First, in order to get the latest version of Faiss you have to upgrade Python to 3.10 since it’s automatically set to 3.7
Then you’ll want to install mini-conda so that you can use install faiss using conda.
Install using conda-forge
!conda install -c conda-forge faiss-gpu -y
Then clone the repo in colab
!git clone https://github.com/abertsch72/unlimiformer.git
Install the requirements. (In this instance some of these aren’t required but I liked to have them just in case)
Cd into the src folder in Unlimiformer
%cd /content/unlimiformer/src
Then you should be good to run the script! Just be sure that the --index_devices and --datastore_device are set correctly. In my case I set them to 0
This worked pretty well after I set the --layer_begin to 22 (a little over half the number of layers in the model). Here’s the summary:
Harry Potter is a young boy who discovers he is a wizard, invited to attend the Hogwarts School of Witchcraft and Wizardry. He embarks on an adventure with his friends Ronald Weasley and Hermione Granger to face various challenges and enemies such as Voldemort and Lord Voldemort's supporters. Their journey involves discovering their true identities, unraveling mysteries, and learning valuable lessons about friendship, courage, and the fight against evil.</s>
Thanks again for all the hard work you and your team did @urialon I’m pretty hyped about this!