What transformers do: cheatsheet

The working description

Tokens in, tokens out, attention layers in the middle. That sentence covers almost every modern AI system you have used for language. The architecture is the same family across chat assistants, summarizers, embedding models, code completers, and multimodal models.

Why transformers replaced RNNs and LSTMs

Problem with sequential models	How the transformer solves it
Long-range connections decay through compressed hidden states	Every token attends directly to every other token in one step
Cannot parallelize across positions in a sequence	All positions process in parallel inside a layer; hardware utilization jumps

Not a more elegant idea, a more parallelizable one. That is the practical reason it won.

The three architectural shapes

Shape	Built for	Canonical examples
Encoder-only	Understanding tasks (classification, NER, embeddings, search)	BERT, RoBERTa, DistilBERT
Decoder-only	Generation (chat, completion, writing)	GPT family, Llama, Mistral
Encoder-decoder	Sequence-to-sequence (translation, summarization)	T5, BART

When choosing a model for a real task, this is usually the first sorting question. Many bad results come from reaching for the wrong shape.

The timeline at a glance

Year	Milestone
2017	”Attention Is All You Need” introduces the architecture
2018	BERT (encoder-only) and GPT-1 (decoder-only) split the field into two branches
2019	GPT-2 shows scale produces surprising capability
2020	GPT-3 pushes scale two orders of magnitude further
Late 2022	ChatGPT makes a transformer chat interface mainstream
2023 onward	Open-weight stream (Llama, Mistral) and frontier proprietary stream (Claude, Gemini, GPT families) running in parallel

Talk in families and trends, not version numbers. Version numbers go stale within months.

Pre-training versus fine-tuning

	Pre-training	Fine-tuning
Goal	Generic capability from a huge corpus	Shape a base model for a task or style
Compute	Weeks to months on large clusters, headline budgets	Hours on a single machine in many cases
Frequency for practical users	Almost never	Sometimes
Most common practical use	Just load a pre-trained model and use it	Adapt a pre-trained model when prompting alone is not enough

The asymmetry matters. The Hugging Face ecosystem is built around the “just load and use” case being easy.

Limits to remember

Bias passes through. A model reflects its training data. English-heavy data, English-heavy strengths and slants.
Hallucination is unavoidable in current generative models. Fluent and confident does not mean correct, and the model has no reliable internal signal that distinguishes the two.
Context length is finite. Behavior on very long inputs is often worse than on short ones, even within the advertised window.
Reasoning is more pattern recognition than deduction. A convincing chain of thought can still arrive at a wrong answer.

These are properties of the technology, not bugs to be patched in the next release.

The Hugging Face ecosystem

Piece	What it is
Hugging Face Hub	The site at `huggingface.co` hosting models, datasets, and Spaces, with model cards describing training data, intended use, limits, and license
`transformers`	Python library to load models and tokenizers with a uniform API regardless of architecture
`datasets`	Data loading and processing library used across the ecosystem
`tokenizers`	Fast tokenization layer
`accelerate`	Same training code runs on laptop CPU, single GPU, or multi-GPU without rewriting

Track 14 will teach you to use these in practice. You can load a strong open-weight model in about three lines of Python.

Words to use precisely

Token: the unit a transformer actually processes. Sometimes a whole word, sometimes a fragment of one.
Pre-training: training a model on a generic objective over a large corpus. The expensive part.
Fine-tuning: continued training on a small task-specific dataset to shape behavior. The cheap part.
Base model: the artifact at the end of pre-training. Broadly capable, not yet instruction-shaped.
Encoder, decoder, encoder-decoder: the three architectural shapes. Picked by which direction the attention can look and which sequence is being attended to.
Hub: shorthand for the Hugging Face model and dataset site at huggingface.co.

Recommended further study

Hugging Face LLM Course, Chapter 1: “Transformer models.” huggingface.co/learn/llm-course/chapter1. Released under Apache 2.0; this lesson mirrors its structure with original prose.