peft: ESM-V2 + QLoRa: high memory usage
System Info
peft==0.5.0
accelerate==0.23.0
transformers==4.34.0
torch==2.0.1+cuda117
bitsandbytes==0.41.1
ubuntu 22.04 and windows 10 python 3.10
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder - My own task or dataset (give details below)
Reproduction
For LoRa:
from transformers import EsmModel
from peft import LoraConfig, get_peft_model
class Encoder(nn.Module):
self.model = EsmModel.from_pretrained("facebook/esm2_t6_8M_UR50D")
config = LoraConfig(
r=8,
lora_alpha=32,
target_modules=[
"query",
"key",
"value",
"dense"
],
inference_mode=False,
lora_dropout=0.05,
bias="none",
)
self.model = get_peft_model(self.model, config)
for param in self.model.pooler.parameters():
param.requires_grad = False
self.pooling_layer = nn.AdaptiveAvgPool1d(output_size=1)
self.head = nn.Linear(self.model.embeddings.position_embeddings.embedding_dim, 5)
def forward(self, x):
features = self.model(input_ids=x['input_ids'], attention_mask=x['attention_mask'])
transposed_feature = features.last_hidden_state.transpose(1, 2)
pooled_features = self.pooling_layer(transposed_feature).squeeze(2)
classification_head = self.head(pooled_features)
return classification_head
For QLoRa:
from transformers import EsmModel
from peft import LoraConfig, get_peft_model
from transformers import BitsAndBytesConfig
class Encoder(nn.Module):
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.float16,
)
self.model = EsmModel.from_pretrained("facebook/esm2_t6_8M_UR50D", quantization_config=quantization_config, load_in_4bit=True)
self.model = prepare_model_for_kbit_training(self.model,
use_gradient_checkpointing=False)
config = LoraConfig(
r=8,
lora_alpha=32,
target_modules=[
"query",
"key",
"value",
"dense"
],
inference_mode=False,
lora_dropout=0.05,
bias="none",
)
self.model = get_peft_model(self.model, config)
for param in self.model.pooler.parameters():
param.requires_grad = False
self.pooling_layer = nn.AdaptiveAvgPool1d(output_size=1)
self.head = nn.Linear(self.model.embeddings.position_embeddings.embedding_dim, 5)
def forward(self, x):
features = self.model(input_ids=x['input_ids'], attention_mask=x['attention_mask'])
transposed_feature = features.last_hidden_state.transpose(1, 2)
pooled_features = self.pooling_layer(transposed_feature).squeeze(2)
classification_head = self.head(pooled_features)
return classification_head
I use this function to print the number of parameters:
def print_trainable_parameters(model, logging):
"""
Prints the number of trainable parameters in the model.
"""
trainable_params = 0
all_param = 0
for _, param in model.named_parameters():
all_param += param.numel()
if param.requires_grad:
trainable_params += param.numel()
logging.info(
f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
)
Expected behavior
When fine-tuning a model using LoRa method, I anticipate the VRAM usage to be relatively similar to the fine-tuning the last layer. However, I’ve observed some discrepancies.
Setup: Task: classification Batch Size: 4 Optimizer: AdamW
Fine-tuning - Last Layer:
Trainable Parameters: 1,232,960 - Total Parameters: 7,840,121 - Percentage of Trainable Parameters: 15.73% - VRAM Usage: 2.7 GB
Fine-tuning - Using LoRa:
Trainable Parameters: 276,480 - Total Parameters: 8,121,721 - Percentage of Trainable Parameters: 3.40% - VRAM Usage: 5.5 GB
Fine-tuning - Using QLoRa:
Trainable Parameters: 276,480 - Total Parameters: 4,384,121 - Percentage of Trainable Parameters: 6.31% - VRAM Usage: 6.1 GB
Observation When using LoRa for fine-tuning, the VRAM usage is unexpectedly higher compared to when only fine-tuning the last layer, even though a smaller percentage of parameters are trainable. This difference is even more pronounced with larger models like esm-v2 650m. Also, when I use QLoRa, the VRAM usage goes even higher.
Tested Hardware I confirmed this behavior across different GPUs: A100 80GB, Titan V, and RTX2070.
About this issue
- Original URL
- State: closed
- Created 9 months ago
- Comments: 18 (5 by maintainers)
I ran some more experiments. For this, I created a script (attached, rename .txt to .py) with 5 settings:
I recorded the allocated and reserved memory for each. Here are the results:
Interestingly, we can see that when it comes to allocated memory, lora uses less than full fine-tuning, and a comparable amount to fine-tuning of the last layer only. However, when it comes to reserved memory, lora seems to use more.
I’m not familiar with the intricacies of how PyTorch decides how much memory to reserve, and maybe I’m measuring things wrong. But this seems to indicate to me that the theoretical memory savings provided by lora are not realized for this model for some reason.
issue-1023.txt
@BenjaminBossan Sorry for the late reply.
This is exactly my training loop:
I am monitoring VRam using nvidia-smi in ubuntu and task manager in windows 10. Also, I use 1024 tokens as the input. Has anyone tested QLoRa fine-tuning for ESM-V2 to ensure it works well?