Cheatsheet: What transformers do, and why they took over AI
The working description
Section titled “The working description”Tokens in, tokens out, attention layers in the middle. That sentence covers almost every modern AI system you have used for language. The architecture is the same family across chat assistants, summarizers, embedding models, code completers, and multimodal models.
Why transformers replaced RNNs and LSTMs
Section titled “Why transformers replaced RNNs and LSTMs”| Problem with sequential models | How the transformer solves it |
|---|---|
| Long-range connections decay through compressed hidden states | Every token attends directly to every other token in one step |
| Cannot parallelize across positions in a sequence | All positions process in parallel inside a layer; hardware utilization jumps |
Not a more elegant idea, a more parallelizable one. That is the practical reason it won.
The three architectural shapes
Section titled “The three architectural shapes”| Shape | Built for | Canonical examples |
|---|---|---|
| Encoder-only | Understanding tasks (classification, NER, embeddings, search) | BERT, RoBERTa, DistilBERT |
| Decoder-only | Generation (chat, completion, writing) | GPT family, Llama, Mistral |
| Encoder-decoder | Sequence-to-sequence (translation, summarization) | T5, BART |
When choosing a model for a real task, this is usually the first sorting question. Many bad results come from reaching for the wrong shape.
The timeline at a glance
Section titled “The timeline at a glance”| Year | Milestone |
|---|---|
| 2017 | ”Attention Is All You Need” introduces the architecture |
| 2018 | BERT (encoder-only) and GPT-1 (decoder-only) split the field into two branches |
| 2019 | GPT-2 shows scale produces surprising capability |
| 2020 | GPT-3 pushes scale two orders of magnitude further |
| Late 2022 | ChatGPT makes a transformer chat interface mainstream |
| 2023 onward | Open-weight stream (Llama, Mistral) and frontier proprietary stream (Claude, Gemini, GPT families) running in parallel |
Talk in families and trends, not version numbers. Version numbers go stale within months.
Pre-training versus fine-tuning
Section titled “Pre-training versus fine-tuning”| Pre-training | Fine-tuning | |
|---|---|---|
| Goal | Generic capability from a huge corpus | Shape a base model for a task or style |
| Compute | Weeks to months on large clusters, headline budgets | Hours on a single machine in many cases |
| Frequency for practical users | Almost never | Sometimes |
| Most common practical use | Just load a pre-trained model and use it | Adapt a pre-trained model when prompting alone is not enough |
The asymmetry matters. The Hugging Face ecosystem is built around the “just load and use” case being easy.
Limits to remember
Section titled “Limits to remember”- Bias passes through. A model reflects its training data. English-heavy data, English-heavy strengths and slants.
- Hallucination is unavoidable in current generative models. Fluent and confident does not mean correct, and the model has no reliable internal signal that distinguishes the two.
- Context length is finite. Behavior on very long inputs is often worse than on short ones, even within the advertised window.
- Reasoning is more pattern recognition than deduction. A convincing chain of thought can still arrive at a wrong answer.
These are properties of the technology, not bugs to be patched in the next release.
The Hugging Face ecosystem
Section titled “The Hugging Face ecosystem”| Piece | What it is |
|---|---|
| Hugging Face Hub | The site at huggingface.co hosting models, datasets, and Spaces, with model cards describing training data, intended use, limits, and license |
transformers | Python library to load models and tokenizers with a uniform API regardless of architecture |
datasets | Data loading and processing library used across the ecosystem |
tokenizers | Fast tokenization layer |
accelerate | Same training code runs on laptop CPU, single GPU, or multi-GPU without rewriting |
Track 14 will teach you to use these in practice. You can load a strong open-weight model in about three lines of Python.
Words to use precisely
Section titled “Words to use precisely”- Token: the unit a transformer actually processes. Sometimes a whole word, sometimes a fragment of one.
- Pre-training: training a model on a generic objective over a large corpus. The expensive part.
- Fine-tuning: continued training on a small task-specific dataset to shape behavior. The cheap part.
- Base model: the artifact at the end of pre-training. Broadly capable, not yet instruction-shaped.
- Encoder, decoder, encoder-decoder: the three architectural shapes. Picked by which direction the attention can look and which sequence is being attended to.
- Hub: shorthand for the Hugging Face model and dataset site at
huggingface.co.
Recommended further study
Section titled “Recommended further study”- Hugging Face LLM Course, Chapter 1: “Transformer models.”
huggingface.co/learn/llm-course/chapter1. Released under Apache 2.0; this lesson mirrors its structure with original prose.