transformers: bug in transformers notebook (training from scratch)?

Hello there!

First of all, I cannot thank @Rocketknight1 enough for the amazing work he has been doing to create tensorflow versions of the notebooks. On my side, I have spent some time and money (colab pro) trying to tie the notebooks together to create a full classifier from scratch with the following steps:

train the tokenizer
train the language model
train de classification head.

Unfortunately, I run into two issues. You can use the fully working notebook pasted below.

First issue: by training my own tokenizer I actually get a perplexity (225) that is way worse than the example shown https://github.com/huggingface/notebooks/blob/new_tf_notebooks/examples/language_modeling-tf.ipynb when using

model_checkpoint = "bert-base-uncased"
datasets = load_dataset("wikitext", "wikitext-2-raw-v1")

This is puzzling as the tokenizer should be fine-tuned to the data used in the original tf2 notebook!

Second, there seem to be some python issue when I try to fine-tune the language model I obtained above with a text classification head.

Granted, the tokenizer and the underlying language model have been trained on another dataset (the wikipedia dataset from the previous two tf2 notebook that is). See https://github.com/huggingface/notebooks/blob/new_tf_notebooks/examples/text_classification-tf.ipynb . However, I should at least get some valid output! Here the model is complaining about some collate function.

Could you please have a look @sgugger @LysandreJik @Rocketknight1 when you can? I would be very happy to contribute this notebook to the Hugging Face community (although most of the credits go to @Rocketknight1). There is a great demand for building language models and NLP tasks from scratch.

Thanks!!!

Code below

get the most recent versions

!pip install git+https://github.com/huggingface/datasets.git
!pip install  transformers

train tokenizer from scratch


from datasets import load_dataset
dataset = load_dataset("wikitext", name="wikitext-2-raw-v1", split="train")
batch_size = 1000

def batch_iterator():
    for i in range(0, len(dataset), batch_size):
        yield dataset[i : i + batch_size]["text"]

all_texts = [dataset[i : i + batch_size]["text"] for i in range(0, len(dataset), batch_size)]
from tokenizers import decoders, models, normalizers, pre_tokenizers, processors, trainers, Tokenizer

tokenizer = Tokenizer(models.WordPiece(unl_token="[UNK]"))
tokenizer.normalizer = normalizers.BertNormalizer(lowercase=True)
tokenizer.pre_tokenizer = pre_tokenizers.BertPreTokenizer()

special_tokens = ["[UNK]", "[PAD]", "[CLS]", "[SEP]", "[MASK]"]

trainer = trainers.WordPieceTrainer(vocab_size=25000, special_tokens=special_tokens)
tokenizer.train_from_iterator(batch_iterator(), trainer=trainer)

cls_token_id = tokenizer.token_to_id("[CLS]")
sep_token_id = tokenizer.token_to_id("[SEP]")
print(cls_token_id, sep_token_id)

tokenizer.post_processor = processors.TemplateProcessing(
    single=f"[CLS]:0 $A:0 [SEP]:0",
    pair=f"[CLS]:0 $A:0 [SEP]:0 $B:1 [SEP]:1",
    special_tokens=[
        ("[CLS]", cls_token_id),
        ("[SEP]", sep_token_id),
    ],
)
tokenizer.decoder = decoders.WordPiece(prefix="##")

from transformers import BertTokenizerFast
mytokenizer = BertTokenizerFast(tokenizer_object=tokenizer)

causal language from scratch using my own tokenizer mytokenizer


model_checkpoint = "bert-base-uncased"
datasets = load_dataset("wikitext", "wikitext-2-raw-v1")

def tokenize_function(examples):
    return mytokenizer(examples["text"], truncation=True)

tokenized_datasets = datasets.map(
    tokenize_function, batched=True, num_proc=4, remove_columns=["text"]
)

block_size = 128

def group_texts(examples):
    # Concatenate all texts.
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
    # customize this part to your needs.
    total_length = (total_length // block_size) * block_size
    # Split by chunks of max_len.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

lm_datasets = tokenized_datasets.map(
    group_texts,
    batched=True,
    batch_size=1000,
    num_proc=4,
)

from transformers import TFAutoModelForMaskedLM
model = TFAutoModelForMaskedLM.from_pretrained(model_checkpoint)

from transformers import create_optimizer, AdamWeightDecay
import tensorflow as tf

optimizer = AdamWeightDecay(lr=2e-5, weight_decay_rate=0.01)


def dummy_loss(y_true, y_pred):
    return tf.reduce_mean(y_pred)


model.compile(optimizer=optimizer, loss={"loss": dummy_loss})

from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(
    tokenizer=mytokenizer, mlm_probability=0.15, return_tensors="tf"
)

train_set = lm_datasets["train"].to_tf_dataset(
    columns=["attention_mask", "input_ids", "labels"],
    shuffle=True,
    batch_size=16,
    collate_fn=data_collator,
)

validation_set = lm_datasets["validation"].to_tf_dataset(
    columns=["attention_mask", "input_ids", "labels"],
    shuffle=False,
    batch_size=16,
    collate_fn=data_collator,
)
model.fit(train_set, validation_data=validation_set, epochs=1)
import math

eval_results = model.evaluate(validation_set)[0]
print(f"Perplexity: {math.exp(eval_results):.2f}")

and fine tune a classification tasks

GLUE_TASKS = [
    "cola",
    "mnli",
    "mnli-mm",
    "mrpc",
    "qnli",
    "qqp",
    "rte",
    "sst2",
    "stsb",
    "wnli",
]


task = "sst2"
batch_size = 16

from datasets import load_dataset, load_metric

actual_task = "mnli" if task == "mnli-mm" else task
dataset = load_dataset("glue", actual_task)
metric = load_metric("glue", actual_task)

and now try to classify text

from transformers import AutoTokenizer

task_to_keys = {
    "cola": ("sentence", None),
    "mnli": ("premise", "hypothesis"),
    "mnli-mm": ("premise", "hypothesis"),
    "mrpc": ("sentence1", "sentence2"),
    "qnli": ("question", "sentence"),
    "qqp": ("question1", "question2"),
    "rte": ("sentence1", "sentence2"),
    "sst2": ("sentence", None),
    "stsb": ("sentence1", "sentence2"),
    "wnli": ("sentence1", "sentence2"),
}
sentence1_key, sentence2_key = task_to_keys[task]
if sentence2_key is None:
    print(f"Sentence: {dataset['train'][0][sentence1_key]}")
else:
    print(f"Sentence 1: {dataset['train'][0][sentence1_key]}")
    print(f"Sentence 2: {dataset['train'][0][sentence2_key]}")

def preprocess_function(examples):
    if sentence2_key is None:
        return mytokenizer(examples[sentence1_key], truncation=True)
    return mytokenizer(examples[sentence1_key], examples[sentence2_key], truncation=True)

pre_tokenizer_columns = set(dataset["train"].features)
encoded_dataset = dataset.map(preprocess_function, batched=True)
tokenizer_columns = list(set(encoded_dataset["train"].features) - pre_tokenizer_columns)
print("Columns added by tokenizer:", tokenizer_columns)

validation_key = (
    "validation_mismatched"
    if task == "mnli-mm"
    else "validation_matched"
    if task == "mnli"
    else "validation"
)

tf_train_dataset = encoded_dataset["train"].to_tf_dataset(
    columns=tokenizer_columns,
    label_cols=["label"],
    shuffle=True,
    batch_size=16,
    collate_fn=mytokenizer.pad,
)
tf_validation_dataset = encoded_dataset[validation_key].to_tf_dataset(
    columns=tokenizer_columns,
    label_cols=["label"],
    shuffle=False,
    batch_size=16,
    collate_fn=mytokenizer.pad,
)

from transformers import TFAutoModelForSequenceClassification
import tensorflow as tf

num_labels = 3 if task.startswith("mnli") else 1 if task == "stsb" else 2

if task == "stsb":
    loss = tf.keras.losses.MeanSquaredError()
    num_labels = 1
elif task.startswith("mnli"):
    loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
    num_labels = 3
else:
    loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
    num_labels = 2

model = TFAutoModelForSequenceClassification.from_pretrained(
    model, num_labels=num_labels
)

from transformers import create_optimizer

num_epochs = 5

batches_per_epoch = len(encoded_dataset["train"]) // batch_size
total_train_steps = int(batches_per_epoch * num_epochs)

optimizer, schedule = create_optimizer(
    init_lr=2e-5, num_warmup_steps=0, num_train_steps=total_train_steps
)
model.compile(optimizer=optimizer, loss=loss)

metric_name = (
    "pearson"
    if task == "stsb"
    else "matthews_correlation"
    if task == "cola"
    else "accuracy"
)

def compute_metrics(predictions, labels):
    if task != "stsb":
        predictions = np.argmax(predictions, axis=1)
    else:
        predictions = predictions[:, 0]
    return metric.compute(predictions=predictions, references=labels)

model.fit(
    tf_train_dataset,
    validation_data=tf_validation_dataset,
    epochs=5,
    callbacks=tf.keras.callbacks.EarlyStopping(patience=2),
)

predictions = model.predict(tf_validation_dataset)["logits"]
compute_metrics(predictions, np.array(encoded_dataset[validation_key]["label"]))

Loading cached processed dataset at /root/.cache/huggingface/datasets/glue/sst2/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-d01ad7112f932f9c.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/glue/sst2/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-de5efda680a1f856.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/glue/sst2/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-0f3c1e00b7f03ba8.arrow
Sentence: hide new secretions from the parental units 
Columns added by tokenizer: ['attention_mask', 'input_ids', 'token_type_ids']
---------------------------------------------------------------------------
VisibleDeprecationWarning                 Traceback (most recent call last)
<ipython-input-42-6eba4122302c> in <module>()
     44     shuffle=True,
     45     batch_size=16,
---> 46     collate_fn=mytokenizer.pad,
     47 )
     48 tf_validation_dataset = encoded_dataset[validation_key].to_tf_dataset(

9 frames
/usr/local/lib/python3.7/dist-packages/datasets/formatting/formatting.py in _arrow_array_to_numpy(self, pa_array)
    165             # cast to list of arrays or we end up with a np.array with dtype object
    166             array: List[np.ndarray] = pa_array.to_numpy(zero_copy_only=zero_copy_only).tolist()
--> 167         return np.array(array, copy=False, **self.np_array_kwargs)
    168 
    169 

VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray

What do you think? Happy to help if I can Thanks!!

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 45 (21 by maintainers)

Most upvoted comments

Hi @randomgambit, sorry for the lengthy delay in replying again! I’m still making changes to some of the lower-level parts of the library, so these notebooks haven’t been fully finalized yet.

The VisibleDeprecationWarning in your first post is something that will hopefully be fixed by upcoming changes to datasets, but for now you can just ignore it.

The error you’re getting in your final post is, I think, caused by you overwriting the variable model in your code. The from_pretrained() method expects a string like bert-base-cased, but it seems like you’ve created an actual TF model with that variable name. If you pass an actual model object to from_pretrained() it’ll get very confused - so make sure that whatever argument you’re passing there is a string and not something else!

Rocketknight1 on Sep 24, 2021

I’m sure @Rocketknight1 will know what’s going on here 😃

sgugger on Sep 17, 2021

Hey all - I’m going to merge the PR with the fix so that it can be included in the next release of transformers this week. However, if you have further problems, please reopen the issue and let me know!

Rocketknight1 on Sep 12, 2022

Hi all, I made a bunch of edits and hopefully things should work more smoothly now! Let me know if the problems remain.

Rocketknight1 on Sep 7, 2022

Have you found any solution @randomgambit? Running into this myself.

jmwoloso on Nov 19, 2021

For the first issue you are training from scratch a new model versus fine-tuning one that has been pretrained on way more data. It’s completely normal that the latter wins. As for the second one, I’m not sure you can directly use the tokenizer.pad method as a collation function.

Note that since you are copying the error messages, you should expand the intermediate frames so we can see where the error comes from.

sgugger on Sep 17, 2021