transformers: "RuntimeError: 'weight' must be 2-D" training with DeepSpeed
System Info
transformersversion: 4.30.2- Platform: Linux-5.19.0-46-generic-x86_64-with-glibc2.35
- Python version: 3.10.11
- Huggingface_hub version: 0.15.1
- Safetensors version: 0.3.1
- PyTorch version (GPU?): 2.0.1+cu117 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: Yes
- Using distributed or parallel set-up in script?: No
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, …) - My own task or dataset (give details below)
Reproduction
The dataset being used is my own dataset that is just a few hundred strings in a CSV file produced by pandas.
Running the following code
from transformers import GPTJForCausalLM, AutoTokenizer, Trainer, TrainingArguments, DataCollatorForLanguageModeling
import os
from torch.utils.data import Dataset
import pandas as pd
import evaluate
import numpy as np
import sklearn
import torch as nn
from transformers.trainer_pt_utils import get_parameter_names
model_name = "EleutherAI/gpt-j-6b"
d_type = "auto"
print("CUDA Available: "+ str(nn.cuda.is_available()))
print("CUDA Version: " + str(nn.version.cuda))
print("GPUs Available: "+ str(nn.cuda.device_count()))
def process_csv(filename, tknizer):
data = pd.read_csv(filename)
return tknizer(list(data["text"].values.flatten()), padding=True, truncation=True, return_tensors="pt")
tokenizer = AutoTokenizer.from_pretrained(model_name, torch_dtype=d_type)
collator = DataCollatorForLanguageModeling(tokenizer, mlm=False)
tokenizer.pad_token = tokenizer.eos_token
class MyDataset(Dataset):
def __init__(self, tokenized_input):
self.tokenized_input = tokenized_input
def __getitem__(self, idx):
return {key: val[idx] for key, val in self.tokenized_input.items()}
def __len__(self):
return len(self.tokenized_input.input_ids)
metric = evaluate.load("accuracy")
def compute_metrics(eval_pred):
logits, labels = eval_pred
predictions = np.argmax(logits, axis=-1)
return metric.compute(predictions=predictions, references=labels)
train_data = MyDataset(process_csv("train_data.csv", tokenizer))
eval_data = MyDataset(process_csv("test_data.csv", tokenizer))
training_args = TrainingArguments(
output_dir="test_trainer",
deepspeed="deepSpeedCPU.json",
)
model = GPTJForCausalLM.from_pretrained(model_name, torch_dtype=d_type).cuda()
print("Total Memory: " + str(nn.cuda.get_device_properties(0).total_memory))
print("Reserved: " + str(nn.cuda.memory_reserved(0)))
print("Allocated: " + str(nn.cuda.memory_allocated(0)))
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_data,
eval_dataset=eval_data,
data_collator=collator,
compute_metrics=compute_metrics,
)
trainer.train()
using the following config file
{
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"betas": "auto",
"eps": "auto",
"weight_decay": "auto"
}
},
"scheduler": {
"type": "WarmupLR",
"params": {
"warmup_min_lr": "auto",
"warmup_max_lr": "auto",
"warmup_num_steps": "auto"
}
},
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"offload_param": {
"device": "cpu",
"pin_memory": true
},
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_16bit_weights_on_model_save": true
},
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"steps_per_print": 2000,
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false
}
Causes an error at trainer.train()
Traceback (most recent call last):
File "/home/augustus/ADAM/main2.py", line 82, in <module>
trainer.train()
File "/home/augustus/miniconda3/envs/adamTraining/lib/python3.10/site-packages/transformers/trainer.py", line 1645, in train
return inner_training_loop(
File "/home/augustus/miniconda3/envs/adamTraining/lib/python3.10/site-packages/transformers/trainer.py", line 1938, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/home/augustus/miniconda3/envs/adamTraining/lib/python3.10/site-packages/transformers/trainer.py", line 2759, in training_step
loss = self.compute_loss(model, inputs)
File "/home/augustus/miniconda3/envs/adamTraining/lib/python3.10/site-packages/transformers/trainer.py", line 2784, in compute_loss
outputs = model(**inputs)
File "/home/augustus/miniconda3/envs/adamTraining/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/augustus/miniconda3/envs/adamTraining/lib/python3.10/site-packages/transformers/models/gptj/modeling_gptj.py", line 854, in forward
transformer_outputs = self.transformer(
File "/home/augustus/miniconda3/envs/adamTraining/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/augustus/miniconda3/envs/adamTraining/lib/python3.10/site-packages/transformers/models/gptj/modeling_gptj.py", line 634, in forward
inputs_embeds = self.wte(input_ids)
File "/home/augustus/miniconda3/envs/adamTraining/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/augustus/miniconda3/envs/adamTraining/lib/python3.10/site-packages/torch/nn/modules/sparse.py", line 162, in forward
return F.embedding(
File "/home/augustus/miniconda3/envs/adamTraining/lib/python3.10/site-packages/torch/nn/functional.py", line 2210, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: 'weight' must be 2-D
Expected behavior
I would expect training to begin or a more verbose error to help fix the issue (if possible to do so from my side)
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 19 (3 by maintainers)
Thought so, please use distributed launcher such as
torchrun,deepspeedoracceleratewhen using DeepSpeed/DDP/FSDP or anytime you are doing distributed training.Please refer: