References: From language models to large multimodal models
Source material
Section titled “Source material”Source material:• Stanford CS25 V4 (May 9, 2024): "From Large Language Models to Large Multimodal Models" Speaker: Ming Ding (Zhipu AI) YouTube: https://www.youtube.com/watch?v=cYfKQ6YG9Qo Course site: https://web.stanford.edu/class/cs25/past/cs25-v4/ License (lecture video): as published on Stanford's public CS25 YouTube channel (link-out only)
Clawdemy provides original notes, summaries, and quizzes derived from this materialfor educational purposes. All rights to the original lecture remain with Stanfordand the speaker.What this lesson draws from each source
Section titled “What this lesson draws from each source”- Ming Ding’s CS25 V4 lecture anchors the case study: CogVLM’s architecture (the visual expert), the two-stage training recipe, and the CogAgent extension to GUI tasks. The lecture is the canonical public walkthrough of the design and the design rationale.
- The framing of encode-then-fuse as a family with first-wave (LLaVA-style) and refined (CogVLM-style) members, the explicit freezing-versus-fine-tuning tradeoff table, and the connection to the “what you should remember” pitfalls are Clawdemy’s own connective tissue.
Going deeper
Section titled “Going deeper”- “CogVLM: Visual Expert for Pretrained Language Models” (Wang et al., 2024). The CogVLM paper itself. Section 2 (Method) is the technical core: the visual expert module and the design rationale, in exactly the level of detail this lesson summarizes.
- “Visual Instruction Tuning” (Liu et al., LLaVA, 2023). The LLaVA paper, the first-wave projector-only counterpoint to CogVLM. Useful for seeing the simpler end of the encode-then-fuse family.
- Stanford CS25 V4 course page. The full V4 schedule for readers who want to see what else the edition covered around the multimodal lecture.
Adjacent topics
Section titled “Adjacent topics”- Vision Transformers (ViT) and CLIP. The pretrained vision encoder that almost every encode-then-fuse LMM uses, and the contrastive image-text pretraining that makes its outputs roughly language-aligned. Foundational to the entire family; not covered as its own lesson in this track’s CS25 scope.
- Native multimodal architectures (the next lesson). The alternative to encode-then-fuse: train one model on mixed modalities from the start. Targets exactly the ceiling identified at the end of this lesson.
- Multimodal agents (lesson 9). CogAgent’s GUI extension is the early bridge to this territory; the lesson on multimodal agents in production picks up the thread.
Community discussion
Section titled “Community discussion”None selected for this lesson. The CogVLM paper and the CS25 lecture together are the strongest public account of this architecture; secondary discussion does not yet add durable value over them. If a canonical thread surfaces, it will be added at the next review.