transformers: bug in transformers notebook (training from scratch)?
Hello there!
First of all, I cannot thank @Rocketknight1 enough for the amazing work he has been doing to create tensorflow
versions of the notebooks. On my side, I have spent some time and money (colab pro) trying to tie the notebooks together to create a full classifier from scratch with the following steps:
- train the tokenizer
- train the language model
- train de classification head.
Unfortunately, I run into two issues. You can use the fully working notebook pasted below.
First issue: by training my own tokenizer I actually get a perplexity
(225) that is way worse than the example shown https://github.com/huggingface/notebooks/blob/new_tf_notebooks/examples/language_modeling-tf.ipynb when using
model_checkpoint = "bert-base-uncased"
datasets = load_dataset("wikitext", "wikitext-2-raw-v1")
This is puzzling as the tokenizer should be fine-tuned to the data used in the original tf2 notebook!
Second, there seem to be some python issue when I try to fine-tune the language model I obtained above with a text classification head.
Granted, the tokenizer
and the underlying language model
have been trained on another dataset (the wikipedia dataset from the previous two tf2 notebook that is). See https://github.com/huggingface/notebooks/blob/new_tf_notebooks/examples/text_classification-tf.ipynb . However, I should at least get some valid output! Here the model is complaining about some collate function.
Could you please have a look @sgugger @LysandreJik @Rocketknight1 when you can? I would be very happy to contribute this notebook to the Hugging Face community (although most of the credits go to @Rocketknight1). There is a great demand for building language models and NLP tasks from scratch.
Thanks!!!
Code below
get the most recent versions
!pip install git+https://github.com/huggingface/datasets.git
!pip install transformers
train tokenizer from scratch
from datasets import load_dataset
dataset = load_dataset("wikitext", name="wikitext-2-raw-v1", split="train")
batch_size = 1000
def batch_iterator():
for i in range(0, len(dataset), batch_size):
yield dataset[i : i + batch_size]["text"]
all_texts = [dataset[i : i + batch_size]["text"] for i in range(0, len(dataset), batch_size)]
from tokenizers import decoders, models, normalizers, pre_tokenizers, processors, trainers, Tokenizer
tokenizer = Tokenizer(models.WordPiece(unl_token="[UNK]"))
tokenizer.normalizer = normalizers.BertNormalizer(lowercase=True)
tokenizer.pre_tokenizer = pre_tokenizers.BertPreTokenizer()
special_tokens = ["[UNK]", "[PAD]", "[CLS]", "[SEP]", "[MASK]"]
trainer = trainers.WordPieceTrainer(vocab_size=25000, special_tokens=special_tokens)
tokenizer.train_from_iterator(batch_iterator(), trainer=trainer)
cls_token_id = tokenizer.token_to_id("[CLS]")
sep_token_id = tokenizer.token_to_id("[SEP]")
print(cls_token_id, sep_token_id)
tokenizer.post_processor = processors.TemplateProcessing(
single=f"[CLS]:0 $A:0 [SEP]:0",
pair=f"[CLS]:0 $A:0 [SEP]:0 $B:1 [SEP]:1",
special_tokens=[
("[CLS]", cls_token_id),
("[SEP]", sep_token_id),
],
)
tokenizer.decoder = decoders.WordPiece(prefix="##")
from transformers import BertTokenizerFast
mytokenizer = BertTokenizerFast(tokenizer_object=tokenizer)
causal language from scratch using my own tokenizer mytokenizer
model_checkpoint = "bert-base-uncased"
datasets = load_dataset("wikitext", "wikitext-2-raw-v1")
def tokenize_function(examples):
return mytokenizer(examples["text"], truncation=True)
tokenized_datasets = datasets.map(
tokenize_function, batched=True, num_proc=4, remove_columns=["text"]
)
block_size = 128
def group_texts(examples):
# Concatenate all texts.
concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
total_length = len(concatenated_examples[list(examples.keys())[0]])
# We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
# customize this part to your needs.
total_length = (total_length // block_size) * block_size
# Split by chunks of max_len.
result = {
k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
for k, t in concatenated_examples.items()
}
result["labels"] = result["input_ids"].copy()
return result
lm_datasets = tokenized_datasets.map(
group_texts,
batched=True,
batch_size=1000,
num_proc=4,
)
from transformers import TFAutoModelForMaskedLM
model = TFAutoModelForMaskedLM.from_pretrained(model_checkpoint)
from transformers import create_optimizer, AdamWeightDecay
import tensorflow as tf
optimizer = AdamWeightDecay(lr=2e-5, weight_decay_rate=0.01)
def dummy_loss(y_true, y_pred):
return tf.reduce_mean(y_pred)
model.compile(optimizer=optimizer, loss={"loss": dummy_loss})
from transformers import DataCollatorForLanguageModeling
data_collator = DataCollatorForLanguageModeling(
tokenizer=mytokenizer, mlm_probability=0.15, return_tensors="tf"
)
train_set = lm_datasets["train"].to_tf_dataset(
columns=["attention_mask", "input_ids", "labels"],
shuffle=True,
batch_size=16,
collate_fn=data_collator,
)
validation_set = lm_datasets["validation"].to_tf_dataset(
columns=["attention_mask", "input_ids", "labels"],
shuffle=False,
batch_size=16,
collate_fn=data_collator,
)
model.fit(train_set, validation_data=validation_set, epochs=1)
import math
eval_results = model.evaluate(validation_set)[0]
print(f"Perplexity: {math.exp(eval_results):.2f}")
and fine tune a classification tasks
GLUE_TASKS = [
"cola",
"mnli",
"mnli-mm",
"mrpc",
"qnli",
"qqp",
"rte",
"sst2",
"stsb",
"wnli",
]
task = "sst2"
batch_size = 16
from datasets import load_dataset, load_metric
actual_task = "mnli" if task == "mnli-mm" else task
dataset = load_dataset("glue", actual_task)
metric = load_metric("glue", actual_task)
and now try to classify text
from transformers import AutoTokenizer
task_to_keys = {
"cola": ("sentence", None),
"mnli": ("premise", "hypothesis"),
"mnli-mm": ("premise", "hypothesis"),
"mrpc": ("sentence1", "sentence2"),
"qnli": ("question", "sentence"),
"qqp": ("question1", "question2"),
"rte": ("sentence1", "sentence2"),
"sst2": ("sentence", None),
"stsb": ("sentence1", "sentence2"),
"wnli": ("sentence1", "sentence2"),
}
sentence1_key, sentence2_key = task_to_keys[task]
if sentence2_key is None:
print(f"Sentence: {dataset['train'][0][sentence1_key]}")
else:
print(f"Sentence 1: {dataset['train'][0][sentence1_key]}")
print(f"Sentence 2: {dataset['train'][0][sentence2_key]}")
def preprocess_function(examples):
if sentence2_key is None:
return mytokenizer(examples[sentence1_key], truncation=True)
return mytokenizer(examples[sentence1_key], examples[sentence2_key], truncation=True)
pre_tokenizer_columns = set(dataset["train"].features)
encoded_dataset = dataset.map(preprocess_function, batched=True)
tokenizer_columns = list(set(encoded_dataset["train"].features) - pre_tokenizer_columns)
print("Columns added by tokenizer:", tokenizer_columns)
validation_key = (
"validation_mismatched"
if task == "mnli-mm"
else "validation_matched"
if task == "mnli"
else "validation"
)
tf_train_dataset = encoded_dataset["train"].to_tf_dataset(
columns=tokenizer_columns,
label_cols=["label"],
shuffle=True,
batch_size=16,
collate_fn=mytokenizer.pad,
)
tf_validation_dataset = encoded_dataset[validation_key].to_tf_dataset(
columns=tokenizer_columns,
label_cols=["label"],
shuffle=False,
batch_size=16,
collate_fn=mytokenizer.pad,
)
from transformers import TFAutoModelForSequenceClassification
import tensorflow as tf
num_labels = 3 if task.startswith("mnli") else 1 if task == "stsb" else 2
if task == "stsb":
loss = tf.keras.losses.MeanSquaredError()
num_labels = 1
elif task.startswith("mnli"):
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
num_labels = 3
else:
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
num_labels = 2
model = TFAutoModelForSequenceClassification.from_pretrained(
model, num_labels=num_labels
)
from transformers import create_optimizer
num_epochs = 5
batches_per_epoch = len(encoded_dataset["train"]) // batch_size
total_train_steps = int(batches_per_epoch * num_epochs)
optimizer, schedule = create_optimizer(
init_lr=2e-5, num_warmup_steps=0, num_train_steps=total_train_steps
)
model.compile(optimizer=optimizer, loss=loss)
metric_name = (
"pearson"
if task == "stsb"
else "matthews_correlation"
if task == "cola"
else "accuracy"
)
def compute_metrics(predictions, labels):
if task != "stsb":
predictions = np.argmax(predictions, axis=1)
else:
predictions = predictions[:, 0]
return metric.compute(predictions=predictions, references=labels)
model.fit(
tf_train_dataset,
validation_data=tf_validation_dataset,
epochs=5,
callbacks=tf.keras.callbacks.EarlyStopping(patience=2),
)
predictions = model.predict(tf_validation_dataset)["logits"]
compute_metrics(predictions, np.array(encoded_dataset[validation_key]["label"]))
Loading cached processed dataset at /root/.cache/huggingface/datasets/glue/sst2/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-d01ad7112f932f9c.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/glue/sst2/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-de5efda680a1f856.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/glue/sst2/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-0f3c1e00b7f03ba8.arrow
Sentence: hide new secretions from the parental units
Columns added by tokenizer: ['attention_mask', 'input_ids', 'token_type_ids']
---------------------------------------------------------------------------
VisibleDeprecationWarning Traceback (most recent call last)
<ipython-input-42-6eba4122302c> in <module>()
44 shuffle=True,
45 batch_size=16,
---> 46 collate_fn=mytokenizer.pad,
47 )
48 tf_validation_dataset = encoded_dataset[validation_key].to_tf_dataset(
9 frames
/usr/local/lib/python3.7/dist-packages/datasets/formatting/formatting.py in _arrow_array_to_numpy(self, pa_array)
165 # cast to list of arrays or we end up with a np.array with dtype object
166 array: List[np.ndarray] = pa_array.to_numpy(zero_copy_only=zero_copy_only).tolist()
--> 167 return np.array(array, copy=False, **self.np_array_kwargs)
168
169
VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
What do you think? Happy to help if I can Thanks!!
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 45 (21 by maintainers)
Hi @randomgambit, sorry for the lengthy delay in replying again! I’m still making changes to some of the lower-level parts of the library, so these notebooks haven’t been fully finalized yet.
The
VisibleDeprecationWarning
in your first post is something that will hopefully be fixed by upcoming changes todatasets
, but for now you can just ignore it.The error you’re getting in your final post is, I think, caused by you overwriting the variable
model
in your code. Thefrom_pretrained()
method expects a string likebert-base-cased
, but it seems like you’ve created an actual TF model with that variable name. If you pass an actual model object tofrom_pretrained()
it’ll get very confused - so make sure that whatever argument you’re passing there is a string and not something else!I’m sure @Rocketknight1 will know what’s going on here 😃
Hey all - I’m going to merge the PR with the fix so that it can be included in the next release of
transformers
this week. However, if you have further problems, please reopen the issue and let me know!Hi all, I made a bunch of edits and hopefully things should work more smoothly now! Let me know if the problems remain.
Have you found any solution @randomgambit? Running into this myself.
For the first issue you are training from scratch a new model versus fine-tuning one that has been pretrained on way more data. It’s completely normal that the latter wins. As for the second one, I’m not sure you can directly use the tokenizer.pad method as a collation function.
Note that since you are copying the error messages, you should expand the intermediate frames so we can see where the error comes from.