Cheatsheet: From language models to large multimodal models
The encode-then-fuse recipe
Section titled “The encode-then-fuse recipe”| Piece | Role | Typical choice |
|---|---|---|
| Vision encoder | image -> sequence of patch embeddings | pretrained ViT, often CLIP-trained |
| Bridge / projector | maps visual vectors -> LLM embedding space | small MLP (LLaVA); deeper module (CogVLM) |
| LLM | processes the mixed sequence, generates text | pretrained transformer (often frozen) |
The mixed sequence
Section titled “The mixed sequence”[ img_tok_1, img_tok_2, ..., img_tok_N, "What is in this image?" ] | | vision encoder + bridge text tokens (LLM tokenizer)CogVLM’s visual expert (the structural refinement)
Section titled “CogVLM’s visual expert (the structural refinement)”| Inside one transformer block | Image tokens | Text tokens |
|---|---|---|
| QKV matrices | visual-expert weights | original LLM weights |
| FFN | visual-expert FFN | original LLM FFN |
| Self-attention | SHARED across both (this is where they interact) | |
| Effect | new visual capacity | language quality preserved |
Two-stage training
Section titled “Two-stage training”| Stage | Trains | LLM weights | Data |
|---|---|---|---|
| 1. Pretraining alignment | visual expert + bridge | frozen | hundreds of millions of image-text pairs |
| 2. Visual instruction tuning | varies; often visual expert + light LLM | mostly frozen | VQA, multimodal chat, instruction-following |
Freezing vs fine-tuning (the tradeoff)
Section titled “Freezing vs fine-tuning (the tradeoff)”| Recipe | Language risk | Visual ceiling |
|---|---|---|
| LLM fully frozen, train only projector | none (preserved) | lower |
| LLM unfrozen end-to-end | real (regression likely) | higher |
| Visual expert + frozen original text weights (CogVLM) | none (preserved) | higher |
CogVLM -> CogAgent
Section titled “CogVLM -> CogAgent”| Same | Different |
|---|---|
| encode-then-fuse architecture | data distribution: GUI screenshots, not natural images |
| visual expert | additional outputs: clickable bounding boxes, UI plans |
| two-stage recipe | early bridge from VLM to multimodal agent |
Where this hits limits
Section titled “Where this hits limits”| Limit | Why |
|---|---|
| Fine-grained detail | encoder patches are coarse |
| Precise spatial reasoning | image tokens are summaries, not pixel-level |
| Deep cross-modal grounding | the two halves were trained separately; alignment is bolted on -> next lesson |