Summary: From language models to large multimodal models

An encode-then-fuse LMM extends a pretrained LLM with a vision encoder and a bridge module, then trains the system in two stages without destroying the LLM’s language ability. CogVLM is the case study: its ‘visual expert’ gives image tokens their own QKV and FFN inside each transformer block, alongside the frozen text-handling weights, so vision capacity grows without language regression. This summary is the scan version of the full lesson.

Core ideas

The starting point. A pretrained LLM that reads text and is blind to images; the goal is to add vision without degrading language quality (the central design constraint).
The two halves. A vision encoder (almost always a pretrained ViT, often CLIP-trained) turns the image into patch embeddings; a bridge (in the simplest case, a small MLP) projects those vectors into the LLM’s embedding space.
The mixed sequence. After projection, the model sees the projected image tokens (image-token-1 through image-token-N) followed by the text tokens of the question, and processes the whole thing like any other sequence.
First-wave architecture (LLaVA-style). ViT + projector + frozen LLM; train the projector during alignment, then a small fine-tune. Cheap and works, with a real ceiling.
CogVLM’s visual expert. A parallel copy of QKV and FFN inside each block, dedicated to image tokens. Text tokens go through the frozen original LLM weights; image tokens go through the visual expert; they interact through shared self-attention. Language quality is preserved by construction; visual capacity grows.
Two-stage training. Pretraining alignment on image-text pairs (LLM frozen), then visual instruction tuning on multimodal dialogue.
CogAgent. Same encode-then-fuse architecture, retrained for GUI understanding (clickable bounding boxes, step-by-step UI plans). Early bridge to multimodal agents.
The ceiling. Vision encoder and LLM were trained separately; they know each other only through alignment, not joint co-evolution. Native multimodal (next lesson) targets this limit.

What changes for you

When you use GPT-4V, Claude with vision, or LLaVA, you are talking to an encode-then-fuse descendant of this recipe (Gemini, by contrast, is natively multimodal, the direction the next lesson takes up). Knowing the recipe also lets you predict where these models struggle: very high resolution detail, precise spatial reasoning, and fine-grained cross-modal grounding tend to be the soft spots, because the visual encoder’s patches are coarse and the bridge between vision and language is, ultimately, an afterthought. The recurring tradeoff worth carrying forward: how aggressively you let the LLM weights move during multimodal training is also how much language quality you risk; designs that preserve language well (frozen LLM, CogVLM’s visual expert) trade a little visual ceiling for stable text behavior. The next lesson takes the opposite bet: train one model on mixed modalities from the start.