References: Transformers beyond text, ViT and Mixture-of-Experts
Source material
Section titled “Source material”Source material:• Stanford CME 295: Transformers & Large Language Models, Autumn 2025 Instructor: Afshine Amidi & Shervine Amidi, Stanford University Course site: https://cme295.stanford.edu/ Cheatsheet: https://cme295.stanford.edu/cheatsheet/ Source lecture (Lecture 9, Current Trends): see course site at https://cme295.stanford.edu/ for the lecture URL License (lecture videos): as published on Stanford's public YouTube channel License (Amidi cheatsheets): MITThis lesson adapts the Vision Transformer and Mixture-of-Experts sectionsof Stanford CME 295 Lecture 9, covering [00:55:14-01:01:00] ViT pipeline,[00:12:00-00:13:30] MoE recap, plus the broader framing of "transformersbeyond text" applications. The cheatsheet covers MoE but not ViT; thislesson carries ViT primarily from the lecture. Clawdemy provides originalnotes, summaries, and quizzes derived from this material for educationalpurposes. All rights to the original lectures remain with Stanford andthe instructors.Foundational papers
Section titled “Foundational papers”The two papers behind this lesson.
-
“An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale”, Dosovitskiy et al., 2020/2021. The Vision Transformer (ViT) paper. Section 3 (the patching + linear-projection pipeline) is the load-bearing technical content; Section 4 (the empirical comparison with CNNs across dataset sizes) is the empirical headline that “ViT wins on large data, CNNs win on small.” Worth reading after this lesson; the technique is more straightforward than the impact suggests.
-
“Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer”, Shazeer et al., 2017. The original sparse-MoE paper from Google. Pre-dates the LLM era by several years; introduced the gating-network-routes-tokens-to-experts pattern that modern MoE LLMs still use. Sections 2 (the gating mechanism) and 3 (the load-balancing and training tricks) are the technical core.
-
“Mixtral of Experts”, Jiang et al., 2024. The Mixtral paper. Useful as a more recent and LLM-specific reference for how MoE works in practice in a frontier-scale open model. Section 2 walks the architecture; section 3 covers training and inference details.
Going deeper
Section titled “Going deeper”A short list, chosen for durability.
-
“Diffusion Transformers”, Peebles & Xie, 2023. The DiT paper. Useful for understanding the diffusion-transformer adaptation mentioned briefly in this lesson; backbone of most modern image-generation models.
-
“Robust Speech Recognition via Large-Scale Weak Supervision” (Whisper), Radford et al., 2022. The Whisper paper. The canonical speech-transformer example; useful for seeing how audio gets tokenized into patches and fed through a transformer-encoder-decoder.
-
“A Survey on Mixture of Experts”, Cai et al., 2024. Recent survey of MoE methods, including the LLM-specific variants. Useful as a one-stop overview when this lesson’s MoE coverage feels insufficient.
-
“LLaVA: Visual Instruction Tuning”, Liu et al., 2023. The LLaVA paper, which the lecturer cited as a popular example of how ViT-style image encoders combine with text-decoder LLMs. Section 3 walks the architecture.
Adjacent topics
Section titled “Adjacent topics”-
Multimodal foundation models. The architectural pattern (modality-specific encoders + shared LLM core) is the foundation for most modern frontier multimodal systems. Search terms: “vision-language models,” “multimodal LLM architecture,” “modality fusion.” Useful for understanding the broader ecosystem this lesson’s ViT and MoE fit into.
-
Sparse vs dense scaling laws. Different scaling dynamics for MoE and dense models. Search terms: “MoE scaling laws,” “Chinchilla for sparse models,” “fine-grained MoE.” Active research area; the boundary of when sparse architectures dominate dense is shifting.
-
Hardware implications of MoE. Training and serving MoE models requires different infrastructure than dense models. Search terms: “expert parallelism,” “MoE serving,” “sparse model inference.” Mostly in vendor blogs; durable academic references are still consolidating.
Stanford CME 295 cheatsheet
Section titled “Stanford CME 295 cheatsheet”- Stanford CME 295 cheatsheet by the Amidi twins. MIT-licensed. Section 3 (optimizations) covers Mixture-of-Experts; Vision Transformers are not separately covered in the cheatsheet (the lecture is the primary source for ViT material). Worth using as a study reference for the MoE half of this lesson.
Community discussion
Section titled “Community discussion”None selected for this lesson. Vendor blog posts (Anthropic, OpenAI, Google DeepMind, Meta AI, DeepSeek) are the better entry points for current production-MoE thinking; durable academic-grade community references are still consolidating. The diffusion-transformer ecosystem is producing strong technical writing in the image-generation research community.