Large multimodal models, in brief

What you’ll learn

This is lesson 2 of Track 24, the opener of Phase 2 (Building large multimodal models). By the end you will be able to walk through the architecture of an encode-then-fuse vision-language model end to end and explain why CogVLM’s visual-expert design lets the model grow visual capability without losing its language ability. The one capability to walk away with: given a vision-language model, identify its three architectural pieces (encoder, bridge, LLM) and predict where in the design language quality is at risk during training.

The lesson maps directly to Ming Ding’s CS25 V4 guest lecture (May 9, 2024); full attribution is in this lesson’s references.

Where this fits

L1 gave you the map of multimodal AI; this lesson plants the most widely-deployed architecture on it. Almost every vision-language model built by extending an existing LLM (GPT-4V, Claude with vision, LLaVA, CogVLM) is an encode-then-fuse descendant of this pattern, so understanding it carries through to most of the systems you actually use. The next lesson asks what changes when you abandon the bolt-on approach and train one model on mixed modalities from the start, which is the natively-multimodal direction (where models like Gemini belong).

Before you start

Prerequisite: Lesson 1, What multimodal AI actually is. You need the encode-then-fuse framing introduced there (vision encoder plus bridge plus LLM, as one of the two dominant strategies), because this lesson unpacks that strategy concretely. Familiarity with transformers (attention, QKV, feed-forward layers) from prior tracks (T11, T13, T20) will help, since CogVLM’s structural choice is a modification of the standard transformer block.

By the end, you’ll be able to

Describe the three pieces of an encode-then-fuse LMM
Explain CogVLM’s visual expert and how it preserves language quality
Walk the two-stage training recipe
Explain how CogAgent extends the architecture to GUI tasks
Identify the ceiling of encode-then-fuse that motivates native multimodal

Time and difficulty

Read time: about 13 minutes
Practice time: about 15 minutes (an architecture-component matching exercise, a freezing-and-fine-tuning judgment question, and flashcards)
Difficulty: standard