trl: Failed to load data in trl 0.7.8/0.7.9.
This is a new regression introduced in trl 0.7.8 (and 0.7.9), 0.7.7 is fine.
We run into issues of ValueError: too many dimensions 'str' when loading data to the trainer.
Here’s a simple LLAMA2+LoRA fine-tuning on IMDB dataset as minimal repro:
#!/usr/bin/env python3
import datasets
import peft
import transformers
import trl
model_dir = "models/Llama-2-7b-hf"
tokenizer = transformers.AutoTokenizer.from_pretrained(model_dir)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
model = transformers.AutoModelForCausalLM.from_pretrained(model_dir)
ds_train = datasets.load_dataset("imdb", split="train[:10]")
trainer = trl.SFTTrainer(
model=model,
args=transformers.TrainingArguments(
output_dir="output",
max_steps=1,
remove_unused_columns=False,
),
peft_config=peft.LoraConfig(
lora_alpha=16,
lora_dropout=0.1,
r=8,
bias="none",
task_type="Causal_LM",
),
train_dataset=ds_train,
tokenizer=tokenizer,
dataset_text_field="text",
max_seq_length=8,
)
trainer.train()
0.7.7 works:
# CUDA_VISIBLE_DEVICES=0 ./test.py
/usr/lib/python3/dist-packages/requests/__init__.py:87: RequestsDependencyWarning: urllib3 (2.1.0) or chardet (4.0.0) doesn't match a supported version!
warnings.warn("urllib3 ({}) or chardet ({}) doesn't match a supported "
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:03<00:00, 1.52s/it]
Detected kernel version 3.10.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
0%| | 0/1 [00:00<?, ?it/s]You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1.24it/s]Attempted to log scalar metric train_runtime:
0.8097
Attempted to log scalar metric train_samples_per_second:
9.88
Attempted to log scalar metric train_steps_per_second:
1.235
Attempted to log scalar metric total_flos:
2538830561280.0
Attempted to log scalar metric train_loss:
4.124451637268066
Attempted to log scalar metric epoch:
0.5
{'train_runtime': 0.8097, 'train_samples_per_second': 9.88, 'train_steps_per_second': 1.235, 'train_loss': 4.124451637268066, 'epoch': 0.5}
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1.24it/s]
0.7.8 failed:
# CUDA_VISIBLE_DEVICES=0 ./test.py
/usr/lib/python3/dist-packages/requests/__init__.py:87: RequestsDependencyWarning: urllib3 (2.1.0) or chardet (4.0.0) doesn't match a supported version!
warnings.warn("urllib3 ({}) or chardet ({}) doesn't match a supported "
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:03<00:00, 1.54s/it]
Map: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 940.17 examples/s]
Detected kernel version 3.10.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
0%| | 0/1 [00:00<?, ?it/s]You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py", line 748, in convert_to_tensors
tensor = as_tensor(value)
File "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py", line 720, in as_tensor
return torch.tensor(value)
ValueError: too many dimensions 'str'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "./test.py", line 38, in <module>
trainer.train()
File "/usr/local/lib/python3.10/dist-packages/trl/trainer/sft_trainer.py", line 317, in train
output = super().train(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1537, in train
return inner_training_loop(
File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1821, in _inner_training_loop
for step, inputs in enumerate(epoch_iterator):
File "/usr/local/lib/python3.10/dist-packages/accelerate/data_loader.py", line 448, in __iter__
current_batch = next(dataloader_iter)
File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 630, in __next__
data = self._next_data()
File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 674, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/fetch.py", line 54, in fetch
return self.collate_fn(data)
File "/usr/local/lib/python3.10/dist-packages/transformers/data/data_collator.py", line 45, in __call__
return self.torch_call(features)
File "/usr/local/lib/python3.10/dist-packages/transformers/data/data_collator.py", line 732, in torch_call
batch = self.tokenizer.pad(examples, return_tensors="pt", pad_to_multiple_of=self.pad_to_multiple_of)
File "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py", line 3299, in pad
return BatchEncoding(batch_outputs, tensor_type=return_tensors)
File "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py", line 223, in __init__
self.convert_to_tensors(tensor_type=tensor_type, prepend_batch_axis=prepend_batch_axis)
File "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py", line 764, in convert_to_tensors
raise ValueError(
ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length. Perhaps your features (`text` in this case) have excessive nesting (inputs type `list` where type `int` is expected).
0%| | 0/1 [00:00<?, ?it/s]
About this issue
- Original URL
- State: closed
- Created 6 months ago
- Comments: 16
I see ok ! if you want you can build from that branch: