Cheatsheet: Transformers beyond text, ViT and Mixture-of-Experts
The one idea that matters
Section titled “The one idea that matters”The transformer block was designed for translation.It turned out to work for almost everything else too.ViT adapts it to non-text inputs.MoE rewires its internal compute for sparse routing.The ViT pipeline
Section titled “The ViT pipeline”IMAGE (e.g., 224×224) ↓Split into fixed-size patches (e.g., 16×16) → 196 patches ↓Each patch → linear projection → 1 vector per patch ↓Prepend CLS token (learned vector) + add position embeddings ↓Run through transformer encoder (standard architecture) ↓Take CLS token's final embedding ↓Feed-forward network → class labelSame transformer block as Phase 2. What changes is input tokenization (patches instead of words) and output decoding (class via CLS).
ViT vs CNN: the trade-off
Section titled “ViT vs CNN: the trade-off”| CNN | ViT | |
|---|---|---|
| Inductive bias | High (translation equivariance, locality, built-in) | Low (must be learned from data) |
| Small data | Wins | Loses |
| Large data | Loses | Wins |
| Modern frontier image work | Embedded/constrained | Default |
ViT as multimodal foundation
Section titled “ViT as multimodal foundation”Image → ViT-style encoder → image tokensText → tokenizer/embeddings → text tokens ↓ LLM decoder ↓ responseExamples: LLaVA, GPT-4V, Claude with vision, Gemini. All ViT-style encoder + LLM.
The MoE mechanism
Section titled “The MoE mechanism”Standard transformer block:
TOKEN → attention → FFN (single dense network) → outputMoE transformer block:
TOKEN → attention → gating network decides: "send to experts 3 and 7" ↓ activate only those 2 of N experts ↓ weighted combine → output ↑ other (N-2) experts idle for this tokenRouting is per-token, not per-input. Different tokens in the same prompt go to different experts.
Dense vs MoE compute math
Section titled “Dense vs MoE compute math”Dense, 70B parameters: active params per token = 70B per-token compute = proportional to 70B
MoE, 200B total params, 2-of-8 routing (each expert ~22B, shared ~6B): active params per token = 2 × 22B + 6B = 50B per-token compute = proportional to 50B
2.85× more total parameters, similar per-token cost.This is why frontier-scale models use MoE. Scale capability without scaling latency.
Reading parameter-count claims
Section titled “Reading parameter-count claims”"1-trillion-parameter model" ↓Ask: dense or MoE? ↓If dense: per-token active = 1T (very expensive)If MoE: per-token active = MUCH less (typically 100-200B) ↓Active-parameters-per-token is the relevant comparisonfor cost and latency.Other adaptations to know exist
Section titled “Other adaptations to know exist”| Adaptation | What it does |
|---|---|
| Diffusion transformers | Transformer self-attention inside denoising diffusion (modern image generation) |
| Speech transformers | Audio mel-spectrograms tokenized into patches, ViT-style (Whisper) |
| Recommendation transformers | Self-attention over user behavior sequences |
| Diffusion-based LLMs | Text generation by denoising; next lesson covers |
Pattern in each case: same transformer block, modality-specific input/output handling.
When to use dense vs MoE
Section titled “When to use dense vs MoE”| Choose dense when | Choose MoE when |
|---|---|
| Memory constrained (single GPU) | More memory available (multi-GPU or cloud) |
| Small batch sizes | Large batch sizes |
| Maximum predictable latency | Want capability scaling without latency scaling |
| Mature tooling required | OK with MoE-specific tooling |
Pitfalls to dodge
Section titled “Pitfalls to dodge”| Pitfall | Reality |
|---|---|
| ”Comparing dense and MoE parameter counts directly.” | Per-token active matters; MoE total is misleading for cost. |
| ”ViT replaced CNNs everywhere.” | No. CNNs still competitive on small data and resource-constrained tasks. |
| ”Transformer-based system for X = architectural innovation.” | Usually the novelty is in tokenization/decoding, not the block. |
| ”MoE = automatically faster.” | MoE is faster per-token compute but has training-stability and memory costs. |
Glossary
Section titled “Glossary”- Vision Transformer (ViT): transformer architecture applied to image patches. Split image into patches, project to vectors, run through encoder, classify via CLS.
- CLS token: learned vector prepended to the input sequence, used to produce a single embedding for the whole input. Convention from BERT (Phase 2).
- Mixture-of-Experts (MoE): transformer variant where the FFN is replaced by N experts plus a gating network that routes each token to a subset (typically 2 of N).
- Gating network: small neural network that decides which experts to send each token to. Trained jointly with the experts.
- Dense model: all parameters are active for every token. Standard transformer.
- Sparse model: only a subset of parameters are active for any given token. MoE is the dominant sparse-architecture pattern.
- Total parameters: the size of the model on disk. Includes inactive parameters.
- Active parameters per token: the parameters used during a forward pass on a single token. The relevant cost number.
- Diffusion transformer: transformer architecture inside a denoising diffusion process for image generation.
The transformer block was designed for translation. It turned out to work for almost everything.
ViT adapts it to non-text inputs. MoE adapts the block’s internal compute for sparse routing.
Most modern AI systems are some combination of these adaptations on the same underlying architecture.