Transformers beyond text: brief

What you’ll learn

This is lesson 4 of Phase 7, How we judge models and where they’re going, in Track 5 (AI Foundations). The transformer block was originally designed for machine translation (Phase 2 covered the original architecture). The interesting part of its story is what happened next: the same block, with minor adaptations, became useful for many things that have nothing to do with text. This lesson covers the two canonical examples. Vision Transformers (ViT) apply the transformer architecture to image patches instead of text tokens, providing the architectural foundation for modern multimodal AI. Mixture-of-Experts (MoE) keeps the transformer architecture but rewires the feed-forward network into a set of experts that get sparsely activated, letting frontier models scale parameter counts dramatically without proportionally scaling per-token compute. The lesson also names other transformer adaptations (diffusion, speech, recommendation, diffusion-based LLMs) without going deep on them, and surfaces the broader pattern: the transformer block is now a general-purpose neural-network primitive, not just a language-model component. Course materials are at cme295.stanford.edu.

Where this fits

This is lesson 4 of Phase 7. The previous lesson (Why tool-using models fail) closed the evaluation arc with a tool-failure taxonomy. This lesson opens the frontier-direction arc by showing how the transformer block has been adapted beyond text. The next lesson (New ways to generate) covers two specific generation-time alternatives to standard autoregressive decoding (speculative decoding and diffusion language models). The track closes with a safety-lens recap that pulls together every safety thread woven through Phases 4 to 7.

Before you start

Prerequisites: the transformer block lesson is required since this lesson assumes you know what the FFN layer does inside a transformer block (MoE swaps that FFN for an expert-routing layer). The BERT architecture lesson is useful since ViT borrows the CLS-token convention from BERT.

By the end, you’ll be able to

Explain what a Vision Transformer (ViT) does (image to patches, patches to tokens, run through encoder, classify via CLS)
Explain what Mixture-of-Experts (MoE) does (sparse expert routing in the FFN layer, parameter count scales without proportional per-token compute)
Summarize in one sentence what ViT and MoE each enable
Recognize when a parameter-count claim refers to total vs active parameters in an MoE model
Recognize the broader pattern (the transformer block as a general-purpose neural-network primitive) and identify other adaptations (diffusion transformers, speech transformers, recommendation transformers)

Time and difficulty

Read time: about 11 minutes
Practice time: about 12 minutes (a self-check on what each adaptation enables, a hands-on exercise comparing dense vs MoE parameter counts, and flashcards)
Difficulty: standard