What transformers do: brief

What you’ll learn

This is the first lesson of Track 14, a hands-on track about using transformers productively through the Hugging Face ecosystem. Before you run anything, you need a working picture of what a transformer is and why it matters, and that is this lesson’s whole job. The source curriculum is the Hugging Face LLM Course, Chapter 1, freely available and Apache-2.0 licensed at huggingface.co/learn/llm-course.

You will get the one-sentence working description (tokens in, tokens out, attention in the middle); see why transformers replaced the recurrent models that came before; learn to sort the three architectural shapes (encoder-only, decoder-only, encoder-decoder) and which tasks each fits; walk a short timeline from the 2017 paper to today’s split between open-weight and frontier proprietary models; separate the expensive pre-training step from cheap fine-tuning; meet the limits you have to design around; and place the Hugging Face platform and libraries that the rest of the track depends on. None of it requires the math.

Where this fits

This is lesson 1 of 12, opening Phase 1 (the Transformers library). It is the conceptual on-ramp: the next lesson stops describing transformers and starts running them with the pipeline() function. Track 5 (Transformers and LLMs) teaches the mechanics under the hood (queries, keys, values, attention math); this track deliberately skips that and teaches you to use the models. If you have done Track 5, this lesson is a fast working summary; if you have not, it is self-contained.

Before you start

Prerequisites: none within the track, this is the opener. You will get more out of it if you are comfortable with the idea of a neural network as a function that learns from data, and Track 14 as a whole assumes you can read and write basic Python (the later lessons walk through real, runnable code). This first lesson is conceptual and reads cleanly without any code in front of you.

About the math

None. This lesson is entirely conceptual. It names attention, pre-training, and fine-tuning and tells you what they do, but it shows no equations. The math of attention lives in Track 5 for readers who want it; productive applied work, the focus of this track, mostly happens without it.

By the end, you’ll be able to

The single capability this lesson builds: explain at a working level what a transformer does (tokens in, tokens out) and distinguish the three architectural shapes, without the math. Concretely, you will be able to:

Describe in plain working terms what a transformer does end to end (tokens in, tokens out, attention in the middle)
Explain why transformers replaced RNNs and LSTMs (direct long-range connections plus parallelism)
Distinguish the three architectural shapes (encoder-only, decoder-only, encoder-decoder) and pick the right one for a task
Distinguish pre-training (the expensive part) from fine-tuning (the cheap part) and place a model accordingly
Name the main limits of current transformers, and describe what the Hugging Face platform and libraries provide

Time and difficulty

Read time: about 11 minutes
Practice time: about 10 minutes (a sort-the-task-to-a-shape exercise and flashcards)
Difficulty: standard (conceptual, no math and no code in this opening lesson)