Skip to content

Cheatsheet: From language models to large multimodal models

PieceRoleTypical choice
Vision encoderimage -> sequence of patch embeddingspretrained ViT, often CLIP-trained
Bridge / projectormaps visual vectors -> LLM embedding spacesmall MLP (LLaVA); deeper module (CogVLM)
LLMprocesses the mixed sequence, generates textpretrained transformer (often frozen)
[ img_tok_1, img_tok_2, ..., img_tok_N, "What is in this image?" ]
| |
vision encoder + bridge text tokens (LLM tokenizer)

CogVLM’s visual expert (the structural refinement)

Section titled “CogVLM’s visual expert (the structural refinement)”
Inside one transformer blockImage tokensText tokens
QKV matricesvisual-expert weightsoriginal LLM weights
FFNvisual-expert FFNoriginal LLM FFN
Self-attentionSHARED across both (this is where they interact)
Effectnew visual capacitylanguage quality preserved
StageTrainsLLM weightsData
1. Pretraining alignmentvisual expert + bridgefrozenhundreds of millions of image-text pairs
2. Visual instruction tuningvaries; often visual expert + light LLMmostly frozenVQA, multimodal chat, instruction-following
RecipeLanguage riskVisual ceiling
LLM fully frozen, train only projectornone (preserved)lower
LLM unfrozen end-to-endreal (regression likely)higher
Visual expert + frozen original text weights (CogVLM)none (preserved)higher
SameDifferent
encode-then-fuse architecturedata distribution: GUI screenshots, not natural images
visual expertadditional outputs: clickable bounding boxes, UI plans
two-stage recipeearly bridge from VLM to multimodal agent
LimitWhy
Fine-grained detailencoder patches are coarse
Precise spatial reasoningimage tokens are summaries, not pixel-level
Deep cross-modal groundingthe two halves were trained separately; alignment is bolted on -> next lesson