Skip to content

Cheatsheet: Transformers beyond text, ViT and Mixture-of-Experts

The transformer block was designed for translation.
It turned out to work for almost everything else too.
ViT adapts it to non-text inputs.
MoE rewires its internal compute for sparse routing.
IMAGE (e.g., 224×224)
Split into fixed-size patches (e.g., 16×16) → 196 patches
Each patch → linear projection → 1 vector per patch
Prepend CLS token (learned vector) + add position embeddings
Run through transformer encoder (standard architecture)
Take CLS token's final embedding
Feed-forward network → class label

Same transformer block as Phase 2. What changes is input tokenization (patches instead of words) and output decoding (class via CLS).

CNNViT
Inductive biasHigh (translation equivariance, locality, built-in)Low (must be learned from data)
Small dataWinsLoses
Large dataLosesWins
Modern frontier image workEmbedded/constrainedDefault
Image → ViT-style encoder → image tokens
Text → tokenizer/embeddings → text tokens
LLM decoder
response

Examples: LLaVA, GPT-4V, Claude with vision, Gemini. All ViT-style encoder + LLM.

Standard transformer block:

TOKEN → attention → FFN (single dense network) → output

MoE transformer block:

TOKEN → attention → gating network decides: "send to experts 3 and 7"
activate only those 2 of N experts
weighted combine → output
other (N-2) experts idle for this token

Routing is per-token, not per-input. Different tokens in the same prompt go to different experts.

Dense, 70B parameters:
active params per token = 70B
per-token compute = proportional to 70B
MoE, 200B total params, 2-of-8 routing (each expert ~22B, shared ~6B):
active params per token = 2 × 22B + 6B = 50B
per-token compute = proportional to 50B
2.85× more total parameters, similar per-token cost.

This is why frontier-scale models use MoE. Scale capability without scaling latency.

"1-trillion-parameter model"
Ask: dense or MoE?
If dense: per-token active = 1T (very expensive)
If MoE: per-token active = MUCH less (typically 100-200B)
Active-parameters-per-token is the relevant comparison
for cost and latency.
AdaptationWhat it does
Diffusion transformersTransformer self-attention inside denoising diffusion (modern image generation)
Speech transformersAudio mel-spectrograms tokenized into patches, ViT-style (Whisper)
Recommendation transformersSelf-attention over user behavior sequences
Diffusion-based LLMsText generation by denoising; next lesson covers

Pattern in each case: same transformer block, modality-specific input/output handling.

Choose dense whenChoose MoE when
Memory constrained (single GPU)More memory available (multi-GPU or cloud)
Small batch sizesLarge batch sizes
Maximum predictable latencyWant capability scaling without latency scaling
Mature tooling requiredOK with MoE-specific tooling
PitfallReality
”Comparing dense and MoE parameter counts directly.”Per-token active matters; MoE total is misleading for cost.
”ViT replaced CNNs everywhere.”No. CNNs still competitive on small data and resource-constrained tasks.
”Transformer-based system for X = architectural innovation.”Usually the novelty is in tokenization/decoding, not the block.
”MoE = automatically faster.”MoE is faster per-token compute but has training-stability and memory costs.
  • Vision Transformer (ViT): transformer architecture applied to image patches. Split image into patches, project to vectors, run through encoder, classify via CLS.
  • CLS token: learned vector prepended to the input sequence, used to produce a single embedding for the whole input. Convention from BERT (Phase 2).
  • Mixture-of-Experts (MoE): transformer variant where the FFN is replaced by N experts plus a gating network that routes each token to a subset (typically 2 of N).
  • Gating network: small neural network that decides which experts to send each token to. Trained jointly with the experts.
  • Dense model: all parameters are active for every token. Standard transformer.
  • Sparse model: only a subset of parameters are active for any given token. MoE is the dominant sparse-architecture pattern.
  • Total parameters: the size of the model on disk. Includes inactive parameters.
  • Active parameters per token: the parameters used during a forward pass on a single token. The relevant cost number.
  • Diffusion transformer: transformer architecture inside a denoising diffusion process for image generation.

The transformer block was designed for translation. It turned out to work for almost everything.
ViT adapts it to non-text inputs. MoE adapts the block’s internal compute for sparse routing.
Most modern AI systems are some combination of these adaptations on the same underlying architecture.