Skip to content

References: Transformers beyond text, ViT and Mixture-of-Experts

Source material:
• Stanford CME 295: Transformers & Large Language Models, Autumn 2025
Instructor: Afshine Amidi & Shervine Amidi, Stanford University
Course site: https://cme295.stanford.edu/
Cheatsheet: https://cme295.stanford.edu/cheatsheet/
Source lecture (Lecture 9, Current Trends):
see course site at https://cme295.stanford.edu/ for the lecture URL
License (lecture videos): as published on Stanford's public YouTube channel
License (Amidi cheatsheets): MIT
This lesson adapts the Vision Transformer and Mixture-of-Experts sections
of Stanford CME 295 Lecture 9, covering [00:55:14-01:01:00] ViT pipeline,
[00:12:00-00:13:30] MoE recap, plus the broader framing of "transformers
beyond text" applications. The cheatsheet covers MoE but not ViT; this
lesson carries ViT primarily from the lecture. Clawdemy provides original
notes, summaries, and quizzes derived from this material for educational
purposes. All rights to the original lectures remain with Stanford and
the instructors.

The two papers behind this lesson.

  • “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale”, Dosovitskiy et al., 2020/2021. The Vision Transformer (ViT) paper. Section 3 (the patching + linear-projection pipeline) is the load-bearing technical content; Section 4 (the empirical comparison with CNNs across dataset sizes) is the empirical headline that “ViT wins on large data, CNNs win on small.” Worth reading after this lesson; the technique is more straightforward than the impact suggests.

  • “Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer”, Shazeer et al., 2017. The original sparse-MoE paper from Google. Pre-dates the LLM era by several years; introduced the gating-network-routes-tokens-to-experts pattern that modern MoE LLMs still use. Sections 2 (the gating mechanism) and 3 (the load-balancing and training tricks) are the technical core.

  • “Mixtral of Experts”, Jiang et al., 2024. The Mixtral paper. Useful as a more recent and LLM-specific reference for how MoE works in practice in a frontier-scale open model. Section 2 walks the architecture; section 3 covers training and inference details.

A short list, chosen for durability.

  • “Diffusion Transformers”, Peebles & Xie, 2023. The DiT paper. Useful for understanding the diffusion-transformer adaptation mentioned briefly in this lesson; backbone of most modern image-generation models.

  • “Robust Speech Recognition via Large-Scale Weak Supervision” (Whisper), Radford et al., 2022. The Whisper paper. The canonical speech-transformer example; useful for seeing how audio gets tokenized into patches and fed through a transformer-encoder-decoder.

  • “A Survey on Mixture of Experts”, Cai et al., 2024. Recent survey of MoE methods, including the LLM-specific variants. Useful as a one-stop overview when this lesson’s MoE coverage feels insufficient.

  • “LLaVA: Visual Instruction Tuning”, Liu et al., 2023. The LLaVA paper, which the lecturer cited as a popular example of how ViT-style image encoders combine with text-decoder LLMs. Section 3 walks the architecture.

  • Multimodal foundation models. The architectural pattern (modality-specific encoders + shared LLM core) is the foundation for most modern frontier multimodal systems. Search terms: “vision-language models,” “multimodal LLM architecture,” “modality fusion.” Useful for understanding the broader ecosystem this lesson’s ViT and MoE fit into.

  • Sparse vs dense scaling laws. Different scaling dynamics for MoE and dense models. Search terms: “MoE scaling laws,” “Chinchilla for sparse models,” “fine-grained MoE.” Active research area; the boundary of when sparse architectures dominate dense is shifting.

  • Hardware implications of MoE. Training and serving MoE models requires different infrastructure than dense models. Search terms: “expert parallelism,” “MoE serving,” “sparse model inference.” Mostly in vendor blogs; durable academic references are still consolidating.

  • Stanford CME 295 cheatsheet by the Amidi twins. MIT-licensed. Section 3 (optimizations) covers Mixture-of-Experts; Vision Transformers are not separately covered in the cheatsheet (the lecture is the primary source for ViT material). Worth using as a study reference for the MoE half of this lesson.

None selected for this lesson. Vendor blog posts (Anthropic, OpenAI, Google DeepMind, Meta AI, DeepSeek) are the better entry points for current production-MoE thinking; durable academic-grade community references are still consolidating. The diffusion-transformer ecosystem is producing strong technical writing in the image-generation research community.