Skip to content

From language models to large multimodal models

This is lesson 2 of Track 24, the opener of Phase 2 (Building large multimodal models). By the end you will be able to walk through the architecture of an encode-then-fuse vision-language model end to end and explain why CogVLM’s visual-expert design lets the model grow visual capability without losing its language ability. The one capability to walk away with: given a vision-language model, identify its three architectural pieces (encoder, bridge, LLM) and predict where in the design language quality is at risk during training.

The lesson maps directly to Ming Ding’s CS25 V4 guest lecture (May 9, 2024); full attribution is in this lesson’s references.

L1 gave you the map of multimodal AI; this lesson plants the most widely-deployed architecture on it. Almost every vision-language model built by extending an existing LLM (GPT-4V, Claude with vision, LLaVA, CogVLM) is an encode-then-fuse descendant of this pattern, so understanding it carries through to most of the systems you actually use. The next lesson asks what changes when you abandon the bolt-on approach and train one model on mixed modalities from the start, which is the natively-multimodal direction (where models like Gemini belong).

Prerequisite: Lesson 1, What multimodal AI actually is. You need the encode-then-fuse framing introduced there (vision encoder plus bridge plus LLM, as one of the two dominant strategies), because this lesson unpacks that strategy concretely. Familiarity with transformers (attention, QKV, feed-forward layers) from prior tracks (T11, T13, T20) will help, since CogVLM’s structural choice is a modification of the standard transformer block.

  • Describe the three pieces of an encode-then-fuse LMM
  • Explain CogVLM’s visual expert and how it preserves language quality
  • Walk the two-stage training recipe
  • Explain how CogAgent extends the architecture to GUI tasks
  • Identify the ceiling of encode-then-fuse that motivates native multimodal
  • Read time: about 13 minutes
  • Practice time: about 15 minutes (an architecture-component matching exercise, a freezing-and-fine-tuning judgment question, and flashcards)
  • Difficulty: standard