Practice: What transformers do

Self-check

Seven short questions. Answer each in your head (or on paper) before opening the collapsible. Pulling the answer out of memory is where the learning sticks; rereading feels productive but does much less.

1. In one sentence, what does a transformer do, regardless of which product it powers?

Show answer

Tokens in, tokens out, with attention layers in the middle. A sequence of tokens goes in, the model runs them through stacked attention and feed-forward layers, and a sequence comes out. What the output represents (a reply, a translation, a classification, an embedding) depends on what the model was trained to produce, not on the architecture, which is the same family across all of them.

2. Why did transformers replace RNNs and LSTMs? Name both reasons.

Show answer

First, long-range connections. A sequential model carries a running summary that gets compressed at every step, so by the tenth word the first word’s signal has faded. A transformer lets every token attend directly to every other token in one step, so a long-range link is a direct lookup, not a decaying memory. Second, parallelism. Sequential models must process word nine before word ten; a transformer processes all positions in a layer at once, which matches how modern hardware multiplies large matrices. The win was practical (faster, scalable), not that the idea is more elegant.

3. Name the three architectural shapes, what each is built for, and a canonical example of each.

Show answer

Encoder-only for understanding (classification, named-entity recognition, embeddings, search); the whole input is visible to every layer. Canonical: the BERT family. Decoder-only for generation; it produces one token at a time, each conditioned only on what came before. Canonical: the GPT family, plus Llama and Mistral. Encoder-decoder for sequence-to-sequence tasks where the output is a related-but-different sequence (translation, summarization); an encoder reads the input fully and a decoder generates while attending back into it. Canonical: T5 and BART.

4. You need to build semantic search over a company’s document store, and separately to flag support tickets as urgent or not. Which shape fits both, and why?

Show answer

Encoder-only. Both jobs are understanding tasks: search needs a rich representation (an embedding) of each document so similar meanings sit close together, and ticket triage is classification of an input you can see in full. Neither job generates new text token by token, so you do not need a decoder. The first sorting question for any real task is which of the three shapes fits, and “understand this input” points at the encoder.

5. Of pre-training and fine-tuning, which is the expensive part, and which one are you actually likely to do?

Show answer

Pre-training is the expensive part: an enormous corpus, weeks to months on large clusters, budgets that make headlines. It produces a broadly capable base model. Fine-tuning is the cheap part: continued training on a small task-specific dataset, often hours on a single machine, to shape a base model into a chat assistant or a classifier. In practice you will almost never pre-train, you will sometimes fine-tune, and you will most often just load a pre-trained model and use it. The Hugging Face ecosystem is built around that last case.

6. Name two of the limits the lesson calls out, and explain why they are properties of the technology rather than bugs awaiting a patch.

Show answer

Any two of: bias passes through (a model reflects its training data, so English-heavy data means English-heavy strengths and slants), hallucination (a generative model produces fluent text with no reliable internal signal separating right from wrong), finite context (behavior on very long inputs is often worse, even within the advertised window), and reasoning is more pattern recognition than deduction (a convincing chain can still reach a wrong answer). They are properties because they fall out of how the model is built and trained: it learns patterns from data and predicts plausible continuations. You manage them with good habits, you do not wait for a release that removes them.

7. Hugging Face is two things working together. What are they?

Show answer

A platform (the Hub at huggingface.co) that hosts models, datasets, and small apps called Spaces, with model cards describing training data, intended use, limits, and license. And a set of open-source Python libraries that make those models usable: transformers (load models and tokenizers with a uniform API), datasets (the data side), tokenizers (fast tokenization), and accelerate (run the same training code on a laptop CPU or a multi-GPU cluster without rewriting). The hub plus the libraries that make the hub usable is the whole reason this track exists.

Try it yourself: sort the task to a shape

About 10 minutes, no code. The single most useful habit from this lesson is asking “which of the three shapes does this task want?” before reaching for a model. Practice it.

Part A: sort six tasks. For each task below, decide whether it most naturally wants an encoder-only, decoder-only, or encoder-decoder model. Write your pick before revealing.

a. Translate English news articles into French.
b. Autocomplete a developer's code as they type.
c. Flag incoming support emails as urgent or not urgent.
d. Power a chat assistant that writes replies to users.
e. Build a search index where similar-meaning documents sit close together.
f. Turn a long report into a short abstract.

What you’ll get

a. Translate → encoder-decoder. One sequence in, a related-but-different sequence out. The classic sequence-to-sequence job (T5, BART).
b. Autocomplete code → decoder-only. Pure left-to-right generation, each token conditioned on what came before.
c. Flag emails → encoder-only. Classification of an input you can see in full. No new text generated.
d. Chat assistant → decoder-only. Generation again; most chat assistants you have used are decoder-only under the hood.
e. Semantic search index → encoder-only. You want an embedding (a rich representation) of each document, which is an understanding task.
f. Summarize a long report → encoder-decoder is the classic shape (input sequence to shorter output sequence), and it is the cleanest answer here. Honest nuance: large decoder-only models now do summarization well too, by treating “here is the report, write a summary” as a generation task. The shapes are a sorting tool, not a cage.

If you got four or more right, the sorting instinct is forming. That instinct alone will save you from a lot of bad model picks.

Part B (reasoning). A teammate proposes using an encoder-only BERT model to power a chatbot that writes replies. Why is that the wrong shape, and what would you reach for instead?

What you should notice

An encoder-only model builds a representation of an input it can see in full; it is not built to generate text one token at a time. A chatbot’s whole job is left-to-right generation, so you want a decoder-only model. This is exactly the failure the lesson warns about: many bad results come from reaching for the wrong shape. The fix is not a bigger encoder, it is the right shape for the job.

Flashcards

Ten cards. Click any card to reveal the answer. Use the Print flashcards button to lay the full set out as one card per page, ready to print or save as a PDF for offline review.

Q. What does a transformer do, in one sentence?

Tokens in, tokens out, with attention layers in the middle. The architecture is the same family across chat, translation, classification, and embeddings; the training objective decides what the output means.

Q. Why did transformers replace RNNs and LSTMs?

Two reasons. Long-range connections become direct lookups instead of decaying summaries, and all positions in a layer process in parallel instead of one after another. The win was parallelizable scale, not a more elegant idea.

Q. What is an encoder-only model for, with an example?

Understanding tasks: classification, named-entity recognition, embeddings, semantic search. The whole input is visible to every layer. Canonical example: the BERT family (also RoBERTa, DistilBERT).

Q. What is a decoder-only model for, with an example?

Generation: producing one token at a time, each conditioned only on what came before. Canonical example: the GPT family, plus Llama and Mistral. Most chat assistants are decoder-only.

Q. What is an encoder-decoder model for, with an example?

Sequence-to-sequence tasks where the output is a related-but-different sequence: translation, summarization. An encoder reads the input fully; a decoder generates while attending back into it. Canonical example: T5 and BART.

Q. What is the first sorting question when picking a model?

Which of the three shapes does the task want: encoder-only for understanding, decoder-only for generation, encoder-decoder for sequence-to-sequence? Many bad results come from reaching for the wrong shape.

Q. What is pre-training, and how expensive is it?

Training a model on a generic objective over an enormous corpus to produce a broadly capable base model. The expensive part: weeks to months on large clusters, headline budgets. You will almost never do it yourself.

Q. What is fine-tuning, and how does it compare in cost?

Continued training of a pre-trained model on a small task-specific dataset to shape its behavior. The cheap part: often hours on a single machine. It turns a base model into a chat assistant, a classifier, a domain summarizer.

Q. Name the four limits to carry into every transformer tool.

Bias passes through (it reflects its training data), hallucination is unavoidable in current generative models, context length is finite, and reasoning is more pattern recognition than deduction. Properties of the technology, not bugs.

Q. What is Hugging Face, in one line?

A platform (the Hub of models, datasets, and Spaces at huggingface.co) plus open-source libraries (transformers, datasets, tokenizers, accelerate) that make those models usable in a few lines of Python.