unlimiformer: TypeError: torch_replacement_knn_gpu() got an unexpected keyword argument 'device'

Hey looks like I’m having some issues working with Llama models. This is the modified script I’m using:

!python run_generation.py --model_type llama --model_name_or_path psmathur/orca_mini_3b \
    --prefix "<<SYS>>\n You are a helpful assistant. Answer with detailed responses according to the entire instruction or question. \n<</SYS>>\n\n [INST] Summarize the following book: " \
    --prompt example_inputs/harry_potter_full.txt \
    --suffix " [/INST]" --test_unlimiformer --fp16 --length 200 --layer_begin 16 \
    --index_devices 1 --datastore_device 0

But I get this error:

2023-08-14 14:28:33.395015: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
08/14/2023 14:28:35 - WARNING - __main__ - device: cuda, n_gpu: 1, 16-bits training: True
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This means that tokens that come after special tokens will not be properly handled. We recommend you to read the related pull request available at https://github.com/huggingface/transformers/pull/24565, and set the legacy attribute accordingly.
Loading checkpoint shards: 100% 3/3 [00:08<00:00,  2.95s/it]
08/14/2023 14:29:16 - INFO - __main__ - Namespace(model_type='llama', model_name_or_path='psmathur/orca_mini_3b', prompt='example_inputs/harry_potter_full.txt', length=200, num_hidden_layers=None, stop_token=None, temperature=1.0, repetition_penalty=1.0, k=0, p=0.9, prefix='<<SYS>>\\n You are a helpful assistant. Answer with detailed responses according to the entire instruction or question. \\n<</SYS>>\\n\\n [INST] Summarize the following book: ', suffix=' [/INST]', padding_text='', xlm_language='', seed=42, no_cuda=False, stream_output=False, num_return_sequences=1, fp16=True, jit=False, device=device(type='cuda'), n_gpu=1)
08/14/2023 14:29:16 - INFO - Unlimiformer - Encoding 0 to 65 out of 65
Traceback (most recent call last):
  File "/content/unlimiformer/src/run_generation.py", line 577, in <module>
    main()
  File "/content/unlimiformer/src/run_generation.py", line 532, in main
    output_sequences = model.generate(
  File "/content/unlimiformer/src/unlimiformer.py", line 529, in pre_generate_hook
    return self.original_generate_func(input_ids_prefix, **new_kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 1642, in generate
    return self.sample(
  File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 2724, in sample
    outputs = self(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/content/unlimiformer/src/unlimiformer.py", line 551, in pre_forward_hook
    result = self.original_forward_func(input_ids=input_ids, labels=labels, attention_mask=attention_mask, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 810, in forward
    outputs = self.model(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 698, in forward
    layer_outputs = decoder_layer(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 413, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/content/unlimiformer/src/unlimiformer.py", line 575, in attention_pre_forward_hook
    result = original_cross_attn_forward_func(hidden_states=hidden_states, attention_mask=attention_mask, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 310, in forward
    query_states = self.q_proj(hidden_states)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1547, in _call_impl
    hook_result = hook(self, args, result)
  File "/content/unlimiformer/src/unlimiformer.py", line 629, in attention_forward_hook
    _, top_search_key_indices = self.datastore[datastore_index].search(datastore_query, k=topk)
  File "/content/unlimiformer/src/index_building.py", line 34, in search
    scores, values = self.indices[i].search(queries[i], k)
  File "/content/unlimiformer/src/index_building.py", line 144, in search
    scores, values = faiss.knn_gpu(faiss.StandardGpuResources(), queries, self.keys, k, 
TypeError: torch_replacement_knn_gpu() got an unexpected keyword argument 'device'

Any ideas on how to fix that?

Thanks again for all the help and for the new features!

About this issue

Most upvoted comments

Just for future reference, command line did you use to generate the last output? (I am curious about the exact model, exact layer_begin, and exact prompt)

Best, Uri

here’s the command for the 80k tokens input:

python src/run_generation.py --model_type llama --model_name_or_path meta-llama/Llama-2-7b-chat-hf \
    --prefix "<<SYS>>\n You are a helpful assistant. Answer with detailed responses according to the entire instruction or question. \n<</SYS>>\n\n [INST] Summarize the following book: " \
    --prompt example_inputs/1.txt \
    --suffix " [/INST]" --test_unlimiformer --fp16 --length 200 --layer_begin 20 \
    --index_devices 1 --datastore_device 2 --stream_output

And surprisingly, I modified the instruction a little in the prefix and using a 135k tokens input to test again. Here’s the result:

=== GENERATED SEQUENCE 1 (input length: 135830) ===
||| This is a cookbook with recipes for various types of salads. The book is divided into seasons, with a chapter for each season, and each chapter contains a variety of salad recipes, each with a brief description of the dish and an explanation of how to prepare it. Each recipe is accompanied by notes and suggestions for complementary dishes, and wine or beer pairings.

* Winter: Three-Alarm Salad, Galette of Greens and Goat cheese, with Braised Mushrooms and Mustard, Bitter greens, carrot and Pickled Zucchini.
* Spring: Sautéed Chicken, Kitchen Garden Greens, Chèvre, and Chive. Spring Onion Tarts, Toppings, 45
* Summer: Zucchettis, Cauliflower, Ratatelli, Chiveroli, and Chilled, and Lemon Dress.

Amazing! And the command for above:

python src/run_generation.py --model_type llama --model_name_or_path meta-llama/Llama-2-7b-chat-hf \
    --prefix "<<SYS>>\n You are a helpful assistant. Answer with detailed responses according to the entire instruction or question. \n<</SYS>>\n\n [INST] Summarize the content of the following book: " \
    --prompt example_inputs/1.txt \
    --suffix " [/INST]" --test_unlimiformer --fp16 --length 200 --layer_begin 20 \
    --index_devices 1 --datastore_device 2 --stream_output

But it’s weird that the model’s output for summarizing harry potter is still strange although using the same flags as the above cookbook:

=== GENERATED SEQUENCE 1 (input length: 131318) ===
||| Harry Potter and the Philosopher's Stone, J.K. Rowling.


He had been famous -- Harry -- since he'd become the
magic's hero... He'd been famous in front of the Muggles, too. 
It was a weird feeling, famous -- being a footloos an' talked to'...' 



They'd worked out how to get past Fluffy without trying

"...Midgit... it was Death E ly... Theoretical..."


















































































(Yes, the model indeed output a lot of \n, maybe it’s harry’s magic I guess…)

Ok finally got it to work in Google Colab on an A100 40G. For anyone curious I used StableBeluga-13B and it took around 9 minutes to get a summary of Harry Potter, which is pretty good especially since you can’t even fit the full book in Claude 100k! I’m thoroughly impressed!

Here is the Code I used to get it working in Colab:

First, in order to get the latest version of Faiss you have to upgrade Python to 3.10 since it’s automatically set to 3.7

 !wget https://github.com/korakot/kora/releases/download/v0.10/py310.sh
!bash ./py310.sh -b -f -p /usr/local
!python -m ipykernel install --name "py310" --user

Then you’ll want to install mini-conda so that you can use install faiss using conda.

################################################################################
# INSTALL CONDA ON GOOGLE COLAB
################################################################################
! wget https://repo.anaconda.com/miniconda/Miniconda3-py310_23.3.1-0-Linux-x86_64.sh
! chmod +x Miniconda3-py310_23.3.1-0-Linux-x86_64.sh
! bash ./Miniconda3-py310_23.3.1-0-Linux-x86_64.sh -b -f -p /usr/local
import sys
sys.path.append('/usr/local/lib/python3.10/site-packages/')

Install using conda-forge

!conda install -c conda-forge faiss-gpu -y

Then clone the repo in colab

!git clone https://github.com/abertsch72/unlimiformer.git

Install the requirements. (In this instance some of these aren’t required but I liked to have them just in case)

%pip install -r requirements.txt
%pip install -q -U bitsandbytes
%pip install -q -U git+https://github.com/huggingface/transformers.git
%pip install -q -U git+https://github.com/huggingface/peft.git
%pip install -q -U git+https://github.com/huggingface/accelerate.git
%pip install -q datasets
%pip install tensorrt

Cd into the src folder in Unlimiformer

%cd /content/unlimiformer/src

Then you should be good to run the script! Just be sure that the --index_devices and --datastore_device are set correctly. In my case I set them to 0

!python run_generation.py --model_type llama --model_name_or_path stabilityai/StableBeluga-13B \
    --prefix "### System:\n You are a helpful assistant. Answer with detailed responses according to the entire instruction or question. \n### User:\n\n [INST] Summarize the following book: " \
    --prompt example_inputs/harry_potter_full.txt \
    --suffix " ### Assistant" --test_unlimiformer --fp16 --length 200 --layer_begin 22 \
    --index_devices 0 --datastore_device 0

This worked pretty well after I set the --layer_begin to 22 (a little over half the number of layers in the model). Here’s the summary:

Harry Potter is a young boy who discovers he is a wizard, invited to attend the Hogwarts School of Witchcraft and Wizardry. He embarks on an adventure with his friends Ronald Weasley and Hermione Granger to face various challenges and enemies such as Voldemort and Lord Voldemort's supporters. Their journey involves discovering their true identities, unraveling mysteries, and learning valuable lessons about friendship, courage, and the fight against evil.</s>

Thanks again for all the hard work you and your team did @urialon I’m pretty hyped about this!