Skip to content

Cheatsheet: What transformers do, and why they took over AI

Tokens in, tokens out, attention layers in the middle. That sentence covers almost every modern AI system you have used for language. The architecture is the same family across chat assistants, summarizers, embedding models, code completers, and multimodal models.

Problem with sequential modelsHow the transformer solves it
Long-range connections decay through compressed hidden statesEvery token attends directly to every other token in one step
Cannot parallelize across positions in a sequenceAll positions process in parallel inside a layer; hardware utilization jumps

Not a more elegant idea, a more parallelizable one. That is the practical reason it won.

ShapeBuilt forCanonical examples
Encoder-onlyUnderstanding tasks (classification, NER, embeddings, search)BERT, RoBERTa, DistilBERT
Decoder-onlyGeneration (chat, completion, writing)GPT family, Llama, Mistral
Encoder-decoderSequence-to-sequence (translation, summarization)T5, BART

When choosing a model for a real task, this is usually the first sorting question. Many bad results come from reaching for the wrong shape.

YearMilestone
2017”Attention Is All You Need” introduces the architecture
2018BERT (encoder-only) and GPT-1 (decoder-only) split the field into two branches
2019GPT-2 shows scale produces surprising capability
2020GPT-3 pushes scale two orders of magnitude further
Late 2022ChatGPT makes a transformer chat interface mainstream
2023 onwardOpen-weight stream (Llama, Mistral) and frontier proprietary stream (Claude, Gemini, GPT families) running in parallel

Talk in families and trends, not version numbers. Version numbers go stale within months.

Pre-trainingFine-tuning
GoalGeneric capability from a huge corpusShape a base model for a task or style
ComputeWeeks to months on large clusters, headline budgetsHours on a single machine in many cases
Frequency for practical usersAlmost neverSometimes
Most common practical useJust load a pre-trained model and use itAdapt a pre-trained model when prompting alone is not enough

The asymmetry matters. The Hugging Face ecosystem is built around the “just load and use” case being easy.

  • Bias passes through. A model reflects its training data. English-heavy data, English-heavy strengths and slants.
  • Hallucination is unavoidable in current generative models. Fluent and confident does not mean correct, and the model has no reliable internal signal that distinguishes the two.
  • Context length is finite. Behavior on very long inputs is often worse than on short ones, even within the advertised window.
  • Reasoning is more pattern recognition than deduction. A convincing chain of thought can still arrive at a wrong answer.

These are properties of the technology, not bugs to be patched in the next release.

PieceWhat it is
Hugging Face HubThe site at huggingface.co hosting models, datasets, and Spaces, with model cards describing training data, intended use, limits, and license
transformersPython library to load models and tokenizers with a uniform API regardless of architecture
datasetsData loading and processing library used across the ecosystem
tokenizersFast tokenization layer
accelerateSame training code runs on laptop CPU, single GPU, or multi-GPU without rewriting

Track 14 will teach you to use these in practice. You can load a strong open-weight model in about three lines of Python.

  • Token: the unit a transformer actually processes. Sometimes a whole word, sometimes a fragment of one.
  • Pre-training: training a model on a generic objective over a large corpus. The expensive part.
  • Fine-tuning: continued training on a small task-specific dataset to shape behavior. The cheap part.
  • Base model: the artifact at the end of pre-training. Broadly capable, not yet instruction-shaped.
  • Encoder, decoder, encoder-decoder: the three architectural shapes. Picked by which direction the attention can look and which sequence is being attended to.
  • Hub: shorthand for the Hugging Face model and dataset site at huggingface.co.
  • Hugging Face LLM Course, Chapter 1: “Transformer models.” huggingface.co/learn/llm-course/chapter1. Released under Apache 2.0; this lesson mirrors its structure with original prose.