Fine-tune a pretrained model on your own data

In lesson 1 you met the asymmetry: pre-training is the expensive part, fine-tuning is the cheap part. This is the lesson where you do the cheap part. You take a model someone else paid to pre-train, continue training it for a few minutes on a dataset of your own, and watch it learn a task it could not do before. The transformers library gives you a Trainer class that handles the hard machinery (the training loop, batching, gradient steps, mixed precision) so the part you write is small.

Keep a notebook open, and this time you want a GPU. Training on a CPU works but is painfully slow; Google Colab gives you a free GPU, which is the path of least resistance. If you have not, install the transformers, datasets, and evaluate libraries.

The setup: data ready to train on

Fine-tuning needs a labeled dataset. We will use MRPC, a small dataset of sentence pairs labeled as paraphrases or not, which is part of the GLUE benchmark. Loading and tokenizing it is the work from the previous lessons, recapped here:

from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding

raw_datasets = load_dataset("glue", "mrpc")
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)

tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

One piece here is new and worth a pause: the data collator. In the last lesson we padded every sentence to the same length up front. That wastes compute, because the longest example in the whole dataset forces every batch to be that long. The padding data collator does dynamic padding instead: it pads each batch only to the length of the longest example in that batch, right as the batch is assembled. Shorter batches stay short. You hand it to the Trainer and forget about it.

The model: a head swap you should expect

Load the model the way you did in lesson 2, but with one new argument that sets the number of labels, because MRPC has two classes (paraphrase or not).

from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

Run this and you will get a warning, and it is not a problem, it is the whole point. BERT was pre-trained on tasks that did not include classifying sentence pairs, so its pretraining head is the wrong shape for this job. The library quietly discards that head and bolts on a fresh sequence-classification head with random weights. The warning is telling you exactly that: some pretrained weights were dropped, some new weights are random, and you should train the model to make the new head useful. That is precisely what comes next. Seeing this warning means the setup worked.

TrainingArguments: the configuration object

Everything about how to train (where to save, learning rate, batch size, how often to evaluate, which speed tricks to use) lives in a single TrainingArguments object. The only argument you must supply is an output directory; the defaults handle the rest for a basic run.

from transformers import TrainingArguments

training_args = TrainingArguments("test-trainer")

That is enough to train. You will tune this object far more than any other piece, so it is worth knowing it is the one knob-box for the whole run.

Trainer: assemble and go

The Trainer takes the model, the arguments, the train and validation splits, the collator, and the tokenizer (passed in to tell the Trainer how to process data, via the processing-class argument). Then one method call starts training.

from transformers import Trainer

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    processing_class=tokenizer,
)

trainer.train()

The trainer train call runs the fine-tuning (a couple of minutes on a GPU) and reports the training loss as it goes. When you pass a tokenizer in, the Trainer defaults its collator to the padding data collator anyway, so the explicit collator line is optional; we keep it to make the moving part visible.

Evaluation: did it actually get better?

Here is the catch: training on its own tells you the training loss is going down, but not whether the model is any good on data it has not seen. Loss going down is necessary, not sufficient. To get a real answer you add two things: tell the Trainer to evaluate, by setting an evaluation strategy, and give it a function that computes a metric you understand.

First, see what the model predicts on the validation set:

predictions = trainer.predict(tokenized_datasets["validation"])
print(predictions.predictions.shape, predictions.label_ids.shape)

(408, 2) (408,)

Those raw predictions are logits again (every transformer outputs logits, as you saw in lesson 2): 408 examples, 2 labels each. To turn logits into a concrete prediction, take the index of the larger value in each row (the argmax operation):

import numpy as np
preds = np.argmax(predictions.predictions, axis=-1)

Now compare those predictions to the true labels using the evaluate library, which knows the standard metrics for named datasets:

import evaluate

metric = evaluate.load("glue", "mrpc")
metric.compute(predictions=preds, references=predictions.label_ids)

{'accuracy': 0.8578, 'f1': 0.8997}

Roughly 86% accuracy and an F1 of 0.90, the standard MRPC metrics, on a model that started this lesson unable to do the task at all. Wrap that logic into the function the Trainer expects, which receives the predictions and labels and returns a dictionary of named metrics:

def compute_metrics(eval_preds):
    metric = evaluate.load("glue", "mrpc")
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

Then rebuild the Trainer with the epoch evaluation strategy and the new function, so it reports real metrics at the end of every pass through the data. Use a fresh TrainingArguments and a fresh model, otherwise you are just continuing to train the model you already trained:

training_args = TrainingArguments("test-trainer", eval_strategy="epoch")
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    processing_class=tokenizer,
    compute_metrics=compute_metrics,
)
trainer.train()

Now each epoch reports validation loss and your metrics alongside the training loss. That is the full loop: prepare data, load a model with the right head, configure, train, and measure.

A few switches worth knowing

The Trainer packages modern training practices behind single arguments in the TrainingArguments object. Three you will reach for:

The fp16 flag turns on mixed-precision training: faster, and uses less GPU memory, usually for free.
The gradient-accumulation argument simulates a larger batch size when your GPU cannot fit one, by accumulating gradients across several small batches before stepping.
The learning rate and its scheduler adjust the most important hyperparameter and how it decays over the run.

You do not need these for a first run, but knowing they are one keyword away is what makes the Trainer scale from a toy example to a serious training job.

Why this matters when you use AI

Fine-tuning is the step that turns a generic model into your model. Most applied AI work is not pre-training a model from scratch (you will almost never do that) and is often not even prompting alone; it is taking a strong base model and shaping it to a specific task with a modest labeled dataset. The Trainer is what makes that accessible: the parts that are genuinely hard (a correct training loop, evaluation plumbing, mixed precision, multi-GPU) are handled, and the parts you own (which data, which metric, which hyperparameters) are the parts that actually encode your problem. And the discipline of computing metrics matters beyond this lesson: a falling training loss is not proof of a good model, a metric on held-out data is. Carry that habit into every model you train.

What you should remember

Fine-tuning continues training a pretrained model on your task-specific data. It is the cheap part from lesson 1, made concrete; you will do it far more often than you pre-train.
A data collator handles dynamic padding: the padding data collator pads each batch to its own longest example instead of padding the whole dataset to one length.
Loading a model with a task head triggers an expected warning: the pretraining head is discarded and a fresh, randomly initialized head is added. That warning means the setup is correct; training makes the new head useful.
TrainingArguments is the single config object for the whole run; only an output directory is required, defaults handle the rest.
Trainer assembles model, args, datasets, collator, and tokenizer, and a trainer train call runs the loop.
A falling training loss is not enough. Set an evaluation strategy and a metrics function: turn logits into predictions (the argmax step), then score them with the evaluate library, to measure quality on held-out data.

Pre-training builds a model that knows language; fine-tuning, in a few minutes and a few lines, makes it know your task.