From language models to large multimodal models
What you’ll learn
Section titled “What you’ll learn”This is lesson 2 of Track 24, the opener of Phase 2 (Building large multimodal models). By the end you will be able to walk through the architecture of an encode-then-fuse vision-language model end to end and explain why CogVLM’s visual-expert design lets the model grow visual capability without losing its language ability. The one capability to walk away with: given a vision-language model, identify its three architectural pieces (encoder, bridge, LLM) and predict where in the design language quality is at risk during training.
The lesson maps directly to Ming Ding’s CS25 V4 guest lecture (May 9, 2024); full attribution is in this lesson’s references.
Where this fits
Section titled “Where this fits”L1 gave you the map of multimodal AI; this lesson plants the most widely-deployed architecture on it. Almost every vision-language model built by extending an existing LLM (GPT-4V, Claude with vision, LLaVA, CogVLM) is an encode-then-fuse descendant of this pattern, so understanding it carries through to most of the systems you actually use. The next lesson asks what changes when you abandon the bolt-on approach and train one model on mixed modalities from the start, which is the natively-multimodal direction (where models like Gemini belong).
Before you start
Section titled “Before you start”Prerequisite: Lesson 1, What multimodal AI actually is. You need the encode-then-fuse framing introduced there (vision encoder plus bridge plus LLM, as one of the two dominant strategies), because this lesson unpacks that strategy concretely. Familiarity with transformers (attention, QKV, feed-forward layers) from prior tracks (T11, T13, T20) will help, since CogVLM’s structural choice is a modification of the standard transformer block.
By the end, you’ll be able to
Section titled “By the end, you’ll be able to”- Describe the three pieces of an encode-then-fuse LMM
- Explain CogVLM’s visual expert and how it preserves language quality
- Walk the two-stage training recipe
- Explain how CogAgent extends the architecture to GUI tasks
- Identify the ceiling of encode-then-fuse that motivates native multimodal
Time and difficulty
Section titled “Time and difficulty”- Read time: about 13 minutes
- Practice time: about 15 minutes (an architecture-component matching exercise, a freezing-and-fine-tuning judgment question, and flashcards)
- Difficulty: standard