Skip to content

References: From language models to large multimodal models

Source material:
• Stanford CS25 V4 (May 9, 2024):
"From Large Language Models to Large Multimodal Models"
Speaker: Ming Ding (Zhipu AI)
YouTube: https://www.youtube.com/watch?v=cYfKQ6YG9Qo
Course site: https://web.stanford.edu/class/cs25/past/cs25-v4/
License (lecture video): as published on Stanford's public CS25 YouTube channel
(link-out only)
Clawdemy provides original notes, summaries, and quizzes derived from this material
for educational purposes. All rights to the original lecture remain with Stanford
and the speaker.
  • Ming Ding’s CS25 V4 lecture anchors the case study: CogVLM’s architecture (the visual expert), the two-stage training recipe, and the CogAgent extension to GUI tasks. The lecture is the canonical public walkthrough of the design and the design rationale.
  • The framing of encode-then-fuse as a family with first-wave (LLaVA-style) and refined (CogVLM-style) members, the explicit freezing-versus-fine-tuning tradeoff table, and the connection to the “what you should remember” pitfalls are Clawdemy’s own connective tissue.
  • Vision Transformers (ViT) and CLIP. The pretrained vision encoder that almost every encode-then-fuse LMM uses, and the contrastive image-text pretraining that makes its outputs roughly language-aligned. Foundational to the entire family; not covered as its own lesson in this track’s CS25 scope.
  • Native multimodal architectures (the next lesson). The alternative to encode-then-fuse: train one model on mixed modalities from the start. Targets exactly the ceiling identified at the end of this lesson.
  • Multimodal agents (lesson 9). CogAgent’s GUI extension is the early bridge to this territory; the lesson on multimodal agents in production picks up the thread.

None selected for this lesson. The CogVLM paper and the CS25 lecture together are the strongest public account of this architecture; secondary discussion does not yet add durable value over them. If a canonical thread surfaces, it will be added at the next review.