trl: Failed to load data in trl 0.7.8/0.7.9.

This is a new regression introduced in trl 0.7.8 (and 0.7.9), 0.7.7 is fine.

We run into issues of ValueError: too many dimensions 'str' when loading data to the trainer. Here’s a simple LLAMA2+LoRA fine-tuning on IMDB dataset as minimal repro:

#!/usr/bin/env python3

import datasets
import peft
import transformers
import trl


model_dir = "models/Llama-2-7b-hf"

tokenizer = transformers.AutoTokenizer.from_pretrained(model_dir)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

model = transformers.AutoModelForCausalLM.from_pretrained(model_dir)

ds_train = datasets.load_dataset("imdb", split="train[:10]")

trainer = trl.SFTTrainer(
    model=model,
    args=transformers.TrainingArguments(
        output_dir="output",
        max_steps=1,
        remove_unused_columns=False,
    ),
    peft_config=peft.LoraConfig(
        lora_alpha=16,
        lora_dropout=0.1,
        r=8,
        bias="none",
        task_type="Causal_LM",
    ),
    train_dataset=ds_train,
    tokenizer=tokenizer,
    dataset_text_field="text",
    max_seq_length=8,
)
trainer.train()

0.7.7 works:

# CUDA_VISIBLE_DEVICES=0 ./test.py 
/usr/lib/python3/dist-packages/requests/__init__.py:87: RequestsDependencyWarning: urllib3 (2.1.0) or chardet (4.0.0) doesn't match a supported version!
  warnings.warn("urllib3 ({}) or chardet ({}) doesn't match a supported "
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:03<00:00,  1.52s/it]
Detected kernel version 3.10.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
  0%|                                                                                                                                                                                                                                 | 0/1 [00:00<?, ?it/s]You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.24it/s]Attempted to log scalar metric train_runtime:
0.8097
Attempted to log scalar metric train_samples_per_second:
9.88
Attempted to log scalar metric train_steps_per_second:
1.235
Attempted to log scalar metric total_flos:
2538830561280.0
Attempted to log scalar metric train_loss:
4.124451637268066
Attempted to log scalar metric epoch:
0.5
{'train_runtime': 0.8097, 'train_samples_per_second': 9.88, 'train_steps_per_second': 1.235, 'train_loss': 4.124451637268066, 'epoch': 0.5}                                                                                                                 
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.24it/s]

0.7.8 failed:

# CUDA_VISIBLE_DEVICES=0 ./test.py 
/usr/lib/python3/dist-packages/requests/__init__.py:87: RequestsDependencyWarning: urllib3 (2.1.0) or chardet (4.0.0) doesn't match a supported version!
  warnings.warn("urllib3 ({}) or chardet ({}) doesn't match a supported "
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:03<00:00,  1.54s/it]
Map: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 940.17 examples/s]
Detected kernel version 3.10.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
  0%|                                                                                                                                                                                                                                 | 0/1 [00:00<?, ?it/s]You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py", line 748, in convert_to_tensors
    tensor = as_tensor(value)
  File "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py", line 720, in as_tensor
    return torch.tensor(value)
ValueError: too many dimensions 'str'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "./test.py", line 38, in <module>
    trainer.train()
  File "/usr/local/lib/python3.10/dist-packages/trl/trainer/sft_trainer.py", line 317, in train
    output = super().train(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1537, in train
    return inner_training_loop(
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1821, in _inner_training_loop
    for step, inputs in enumerate(epoch_iterator):
  File "/usr/local/lib/python3.10/dist-packages/accelerate/data_loader.py", line 448, in __iter__
    current_batch = next(dataloader_iter)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 630, in __next__
    data = self._next_data()
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 674, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/fetch.py", line 54, in fetch
    return self.collate_fn(data)
  File "/usr/local/lib/python3.10/dist-packages/transformers/data/data_collator.py", line 45, in __call__
    return self.torch_call(features)
  File "/usr/local/lib/python3.10/dist-packages/transformers/data/data_collator.py", line 732, in torch_call
    batch = self.tokenizer.pad(examples, return_tensors="pt", pad_to_multiple_of=self.pad_to_multiple_of)
  File "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py", line 3299, in pad
    return BatchEncoding(batch_outputs, tensor_type=return_tensors)
  File "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py", line 223, in __init__
    self.convert_to_tensors(tensor_type=tensor_type, prepend_batch_axis=prepend_batch_axis)
  File "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py", line 764, in convert_to_tensors
    raise ValueError(
ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length. Perhaps your features (`text` in this case) have excessive nesting (inputs type `list` where type `int` is expected).
  0%|          | 0/1 [00:00<?, ?it/s]

About this issue

Original URL
State: closed
Created 6 months ago
Comments: 16

Most upvoted comments

I see ok ! if you want you can build from that branch:

pip install -U git+https://github.com/huggingface/trl.git@fix-breaking-change

younesbelkada on Jan 15, 2024