Transformers beyond text: ViT and MoE

The transformer block was originally designed for machine translation. Phase 2 covered that original architecture in detail. The interesting part of the transformer’s story is what happened next: the same block, with minor adaptations, turned out to be useful for many things that have nothing to do with text.

This lesson covers two of those adaptations. Vision Transformers (ViT) take the transformer block and apply it to image patches instead of text tokens. The result is image classification and image understanding that competes with the convolutional networks the field had used for a decade. Mixture-of-Experts (MoE) keeps the transformer architecture but rewires the feed-forward network into a set of “experts” that fire only when needed. This lets frontier models grow their parameter counts sharply without growing the compute spent per token.

These are not the only adaptations. The lecturer mentions diffusion transformers, recommendation transformers, and speech transformers as further examples; you will hear those names if you read across AI research. Two are enough to make the broader point: the transformer block has been adapted in major directions, and ViT and MoE are the canonical examples to know.

This is lesson 4 of Phase 7, the second-to-last lesson on the path to closing the track. The next lesson covers new ways models generate (speculative decoding and diffusion language models). The closer pulls together the safety threads from Phases 4 through 7.

Vision Transformers (ViT)

The traditional approach for image tasks was the convolutional neural network (CNN). CNNs had a strong bias built in. They processed images as 2D grids of pixels. They looked at small local patches through convolutions. Then they combined those patches through pooling and stacking. The architecture made strong assumptions about images: that a feature means the same thing wherever it appears, and that local features build into global ones. Those assumptions matched image data well enough that CNNs led image work for years.

The Vision Transformer paper (Dosovitskiy et al., 2021) tried almost the opposite. It did not build image-specific structure into the model. Instead it took the transformer block, which assumes very little about its inputs, and fed it image patches as if they were tokens.

Here is what that looks like end-to-end:

Split the image into patches. Take an image (say 224×224 pixels) and divide it into a grid of fixed-size non-overlapping patches (typically 16×16 pixels each, giving 196 patches for a 224×224 image).
Convert each patch to a vector. Each patch is a small grid of pixels; flatten it and project through a learned linear layer to get a single vector per patch. Now each patch is a “token” in the sequence sense.
Add a CLS token at the start. Same idea as BERT (Phase 2 lesson). A learned vector that gets to attend to all the patch tokens.
Add position embeddings. Each patch needs to know where it sat in the original image. Position embeddings are added the same way they are in text transformers (Phase 1’s lesson is the original answer; modern ViTs sometimes use learned 2D position embeddings).
Run through a transformer encoder. The standard architecture. Self-attention lets every patch attend to every other patch, just as text-transformer attention lets every token attend to every other token.
Project the CLS token’s final embedding to a class. A small feed-forward network at the end maps the CLS embedding to a class label (cat, dog, teddy bear, etc.).

The Stanford lecturer frames why this works. With enough training data, the model learns the biases that CNNs had built in, rather than needing the architecture to impose them. The ViT paper’s headline result was simple. On small datasets, CNNs still won. On large datasets, ViT caught up and then passed them. So a built-in bias helps when data is scarce. With enough data, a more flexible, higher-capacity model can learn the bias on its own.

That point carried weight beyond image classification. Once ViT showed that the transformer block worked for non-text inputs, the field began applying transformers to other kinds of data. The first wave of multimodal models (vision-language, audio-text) usually paired a ViT-style image encoder with a text-decoder LLM. Cross-attention let the text model see the image tokens. The lecturer mentions LLaVA as one popular example.

The 2026 frontier has moved one step further. Models like GPT-4o, GPT-5.x, Gemini 3.x, and the Llama 4 herd are described as natively multimodal or Omni architectures. Vision and audio are not bolted-on encoders feeding a text decoder. They are first-class inputs that the model trained on jointly from the start. The core building block is still the transformer block. What changed is that the modality encoders and the text core were trained as one model, rather than glued together after the fact. The user-visible effect is smoother cross-modal reasoning. The model can take in an image and a few seconds of audio in the same turn, with no handoff seams between them.

The mental model from this lesson still applies. Whether you are reading about a ViT-LLM cross-attention adapter (the bolted-on shape) or an Omni model (the native shape), an image goes through some transformer-block-shaped path, comes out as a sequence of tokens, and gets fed alongside text tokens for the model to work with. The difference is whether those tokens flow through one trained-together model or two glued-together models.

Mixture-of-Experts (MoE)

The other major adaptation is structural rather than modal. ViT changed what kind of input the transformer processes; MoE changes what happens inside the transformer block.

The motivation: as language models scaled up, the question became whether all those parameters need to be active for every token. A 100-billion-parameter dense model uses all 100 billion parameters for every forward pass on every input token. That is expensive. If you could route different inputs to different subsets of parameters, you could have more total parameters without proportionally more per-token compute.

That is what MoE does. The trick is in the feed-forward network layer of the transformer block (the FFN that comes after attention; covered in Phase 2’s transformer-block lesson). Instead of having one FFN, you have many “experts,” each of which is its own FFN with its own weights. A small gating network decides which experts to send each token to, typically choosing 2 of N experts.

So the per-token computation looks like:

Token enters the FFN layer with its representation from the attention sublayer.
The gating network produces a routing decision. “Send this token to expert 3 and expert 7.” The decision is per-token, not per-input.
Only those two experts are activated. The other N-2 experts sit idle for this token.
The two experts’ outputs are combined (typically via weighted sum, with the gate’s confidences as weights) and passed up.

Total parameters grow with N, the number of experts. Per-token compute stays roughly flat, since only 2 experts run each time. The gain is large. Take a 200-billion-parameter MoE model with 8 experts and 2-of-8 routing. It carries 4 times the total parameters of a 50-billion-parameter dense model, yet costs about the same per token.

The Stanford lecturer flags one practical detail. Routing happens at the token level, not the input level. Different tokens in the same prompt can go to different experts. This lets you place experts on different GPUs and run the routing in parallel. That is why MoE is easier to train at scale than the raw parameter count would suggest.

Several frontier-grade open-source models are MoE: Mixtral, DeepSeek-V3, GPT-OSS, GLM 4. Closed-source frontier models often are too, though vendor disclosure varies. When you read about a “1-trillion-parameter model,” there is a good chance it is MoE; the active-parameters-per-token number is much smaller (maybe 200 billion), which is why the model runs at all.

What ViT and MoE each enable, in one sentence

ViT enables the transformer architecture to process non-text modalities (images, video frames, audio spectrograms, almost anything that can be tokenized into patches). It is the architectural foundation for modern multimodal models.
MoE enables scaling total parameter count dramatically without proportionally scaling per-token compute. It is the architectural reason frontier models can have parameter counts that would otherwise be unaffordable to run.

Different things; same underlying point. The transformer block has been adapted in major directions, and these two are the canonical examples of how that adaptation works.

A handful of other adaptations worth naming

Out of scope for this lesson but worth knowing exist:

Diffusion transformers apply transformer self-attention inside a denoising diffusion process for image generation. Most modern image-generation models (Stable Diffusion 3, Flux, the diffusion side of GPT-4o) use transformer-based diffusion architectures.
Speech transformers process audio (often via mel-spectrograms tokenized into patches, ViT-style) for speech recognition and synthesis. Whisper is the canonical example.
Recommendation transformers apply self-attention over user behavior sequences for ranking and personalization tasks. Common in production at large platforms.
Diffusion-based LLMs (the next lesson covers these): instead of generating text token-by-token, generate text by denoising. Active research area.

The pattern is the same in each case: the transformer block, with minor adaptations to how inputs are tokenized and how outputs are decoded. The block itself is doing the work.

Why this matters when you use AI

Three things to hold onto.

The transformer is now a general-purpose neural-network primitive, not just a language-model component. When you read about a “vision model,” “audio model,” or “multimodal system,” the underlying architecture is almost certainly a transformer with modality-specific adaptations. Knowing this means a single mental model (Phase 2’s transformer block) carries you across most of modern AI research, not just LLMs.
MoE is why frontier models can have massive parameter counts. When you see “X-billion-parameter model” in a press release, ask whether it is dense or MoE. The two have very different cost profiles. Take a 100-billion-parameter dense model and a 100-billion-parameter MoE model with 2-of-8 routing. The MoE one runs roughly 4 times faster per token. The active-parameter count is the comparison that matters.
Multimodal AI is closer to “more transformers” than to “different architecture.” The shift from text-only to vision-language to general multimodal is largely a story of plugging more ViT-style encoders into the same LLM core. The interesting innovations are increasingly in how modalities interact (cross-attention patterns, token-mixing strategies) rather than in the underlying transformer block.

Common pitfalls

Three mistakes worth dodging.

Treating MoE parameter counts as comparable to dense parameter counts. A 200-billion-parameter MoE model with 2-of-8 routing has roughly the per-token compute of a 50-billion-parameter dense model. The “200 billion” number is the total parameter count, not the active count. Comparing this to a 200-billion-parameter dense model is not apples-to-apples on speed, even if it is on capability.

Assuming ViT replaced CNNs for everything. It did not. CNNs are still competitive on small datasets and on tasks where their inductive biases (locality, translation equivariance) match the data. For frontier-scale image and multimodal work, ViT-style encoders dominate; for embedded-vision applications and constrained-data tasks, CNNs are often still the better tool.

Overinterpreting “transformer for X” announcements. When a paper claims “we built a transformer-based system for [novel task],” the underlying transformer block is almost always close to the standard architecture. The novelty is usually in input tokenization, output decoding, or training data, not in the block itself. Reading these claims carefully helps separate genuine architectural innovation from “we used a known architecture for a new task.”

What you should remember

The transformer block has been adapted in major directions beyond text. ViT and MoE are the two canonical examples; many others exist.
Vision Transformers (ViT) apply the transformer architecture to image patches. Split image into patches, project to vectors, add CLS and position embeddings, run through encoder, project CLS embedding to class. With enough data, the model learns inductive biases CNNs had built in. Foundation for modern multimodal systems.
Mixture-of-Experts (MoE) keeps the transformer architecture but replaces the dense feed-forward layer with multiple experts and a gating network that routes each token to a subset. Scales total parameter count without proportionally scaling per-token compute. Used by most frontier-scale models.
In one sentence each. ViT enables transformers to process non-text modalities. MoE enables scaling parameter counts without scaling per-token compute.
The transformer block is now a general-purpose neural-network primitive. Reading “transformer-based system for X” should not surprise you anywhere in modern AI research.

If you remember one thing

The transformer block was designed for translation. It turned out to work for almost everything.
ViT adapts it to non-text inputs. MoE adapts the block’s internal compute for sparse routing.
Most modern AI systems are some combination of these adaptations on the same underlying architecture.