exllama: Illegal memory access when using a lora
Getting this on inference when I have a lora loaded (loading the lora itself doesn’t produce any errors).
Using text-generation-webui.
File "/home/user/text-generation-webui/modules/models.py", line 309, in clear_torch_cache torch.cuda.empty_cache() File "/home/user/.local/lib/python3.10/site-packages/torch/cuda/memory.py", line 133, in empty_cache torch._C._cuda_emptyCache() RuntimeError: CUDA error: an illegal memory access was encountered Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
I just trained this with qlora, unfortunately I can’t use the Transformers loader because it takes between 15-45 minutes (not exaggerating, just waited 45 minutes for the last one to load before giving up) to load a Lora and I can’t find any reports of the same issue. So I’m trying to load this with exllama on top of a GPTQ version of llama-2-70b. I’m not even sure if that’s possible, but previous loras I’ve trained with other libraries have worked fine on llama 1 gptq.
I don’t think I’m out of VRAM, this is failing on a context size of maybe 20 tokens and I’m on an A6000. Single GPU nothing fancy. I can go up to at least 3000 tokens context with transformers, when I am patient enough to wait the half hour or whatever it takes to load. No problems once it loads.
Possibly relevant args from my qlora training:
--lora_r 64 \ --lora_alpha 16 \ --lora_modules all \ --double_quant \ --quant_type nf4 \ --bf16 \ --bits 4 \ --lora_dropout 0.1
My adapter_config.json if it’s relevant:
{ "auto_mapping": null, "base_model_name_or_path": "meta-llama/Llama-2-70b-hf", "bias": "none", "fan_in_fan_out": false, "inference_mode": true, "init_lora_weights": true, "layers_pattern": null, "layers_to_transform": null, "lora_alpha": 16.0, "lora_dropout": 0.1, "modules_to_save": null, "peft_type": "LORA", "r": 64, "revision": null, "target_modules": [ "v_proj", "gate_proj", "k_proj", "down_proj", "up_proj", "o_proj", "q_proj" ], "task_type": "CAUSAL_LM"
This is the file structure of the lora I have, not sure if relevant either:
About this issue
- Original URL
- State: open
- Created a year ago
- Comments: 32 (15 by maintainers)
For those who struggles with this error in text-generation-webui and couldn’t figure out how to switch off fused_attn from this thread (like me): One needs to switch to llama-HF model loader and uncomment the following line in modules/exllama_hf.py:
config.fused_attn = False
Hmm, I remember doing some napkin math when someone asked if 70B would fit in 40GB, and my estimate was that it would probably just squeeze into 40GB (single-card) at not-quite-full context, and probably wouldn’t fit on two cards of exactly 40GB (e.g. 24+16) with the overhead of an extra card factored in. So running 70B at all is a really tight squeeze already.
My understanding is that ExLlama will keep a loaded LoRA in VRAM separately from the base model weights, and the LoRA weights are read as needed when the generator is triggered, which allows you to swap out LoRAs as often as you’d like without having to reload the entire model. I haven’t looked, but Transformers might just plaster the LoRA weights on top of the model weights in VRAM, which would leave that memory open for context instead.
Can’t really think of a solution that isn’t annoying; you’d either want a little more VRAM (like literally 2GB more), or a smaller LoRA, or the LoRA premerged onto the 70B weights, or a way to irreversibly merge the weights in memory with ExLlama.
Is the 70B GPTQ quant you’re using a group-sized one? Going to an ungrouped model from a 128 group-size one would save around 1.4GB. If it’s already an ungrouped quant then that’s already as small as ExLlama currently supports, though.