Summary: Transformers beyond text, ViT and Mixture-of-Experts
The transformer block was designed for translation. It turned out to work for almost everything else too. With minor adaptations, the same architecture has been applied to images (Vision Transformers), audio, video, recommendations, and image generation. The block itself is doing the work; what changes is mostly how inputs get tokenized and how outputs get decoded.
Vision Transformers (ViT) split images into fixed-size patches, project each patch to a vector, add a CLS token and position embeddings, and run the result through a transformer encoder. The CLS token’s final embedding is projected to a class label. With enough training data, ViT learns image-classification inductive biases that CNNs had built in by hand. ViT is now the architectural foundation for modern multimodal systems (vision-language models, audio-text models, etc.).
Mixture-of-Experts (MoE) keeps the standard transformer architecture but replaces the dense feed-forward layer with multiple “experts” (each its own FFN) and a small gating network that routes each token to a subset of experts (typically 2 of N). Total parameter count scales linearly with N; per-token compute stays roughly constant. Frontier-scale open-source models like Mixtral and DeepSeek-V3 are MoE; many closed-source frontier models are too.
In one sentence each. ViT enables transformers to process non-text modalities. MoE enables scaling parameter counts without scaling per-token compute.
This summary is the scan-it-in-five-minutes version. The full lesson covers the end-to-end ViT pipeline, the per-token routing mechanism for MoE, and other transformer adaptations (diffusion, speech, recommendation, diffusion-based LLMs) that exist beyond these two.
Core ideas
Section titled “Core ideas”- Transformer is a general-purpose neural-network primitive. Originally designed for machine translation; now reused across modalities and architectures.
- ViT pipeline. Split image into fixed-size patches (typically 16×16 pixels). Project each patch to a vector (linear layer). Add learned CLS token and position embeddings. Run through encoder. Project CLS final embedding to class.
- Why ViT works. With enough training data, the model learns inductive biases (translation equivariance, local-features-compose) that CNNs had built in by hand. On small data, CNNs win; on large data, ViT wins.
- ViT as multimodal foundation. Modern multimodal systems (LLaVA, GPT-4V, Claude with vision, Gemini) typically have a ViT-style image encoder feeding into a text-decoder LLM.
- MoE mechanism. In each FFN layer, gating network routes each token to ~2 of N experts. Only those experts are activated; rest sit idle. Total parameters scale with N; per-token compute stays roughly constant.
- MoE routing is per-token. Different tokens within the same prompt go to different experts. Lets you place experts on different GPUs and parallelize.
- MoE in production. Mixtral, DeepSeek-V3, GPT-OSS, GLM 4 (open). Many closed-source frontier models reportedly MoE.
- Reading parameter counts. When you see “1 trillion parameters,” ask whether dense or MoE. Active-parameters-per-token is the relevant comparison; for MoE that number is much smaller than total.
- Other adaptations exist. Diffusion transformers (image generation), speech transformers (Whisper-style audio), recommendation transformers, diffusion-based LLMs (next lesson). Same block, different input/output handling.
- Pitfall: comparing dense and MoE parameter counts directly. Per-token compute differs by the activation factor; not apples-to-apples on speed.
- Pitfall: assuming ViT replaced CNNs. It did not. CNNs are still competitive on small datasets and resource-constrained tasks; ViT dominates frontier-scale image and multimodal work.
What changes for you
Section titled “What changes for you”After this lesson, “transformer-based system for X” announcements stop being mysterious. The underlying block is the same one Phase 2 covered. The novelty is usually in input tokenization, output decoding, or training data, not the block itself. Reading “1-trillion-parameter model” announcements becomes more careful: dense or MoE? Active-parameters-per-token tells you the actual cost profile.
The transformer block was designed for translation. It turned out to work for almost everything.
ViT adapts it to non-text inputs. MoE adapts the block’s internal compute for sparse routing.
Most modern AI systems are some combination of these adaptations on the same underlying architecture.