Cheatsheet: Reasoning over multimodal inputs
Reasoning models in one line
Section titled “Reasoning models in one line”| Aspect | Standard LLM | Reasoning model |
|---|---|---|
| Output style | one forward pass | long internal chain of thought, then answer |
| Inference compute | low per query | much higher per query |
| Strength | retrieval-style recall | deliberate multi-step reasoning |
What multimodal reasoning enables
Section titled “What multimodal reasoning enables”| Capability | Beats a single-pass VLM at |
|---|---|
| Visual problem solving | math from images, chart-data computation |
| Self-checking across modalities | catching its own misreading; correcting course |
| Grounded multi-step reasoning | answers citing visual evidence step by step |
Tools a multimodal reasoning model commonly calls
Section titled “Tools a multimodal reasoning model commonly calls”| Tool category | Examples | What it extends |
|---|---|---|
| Vision tools | zoom, crop, OCR-on-region | perception (re-examine closely) |
| Code execution | arithmetic, plotting, verification | reasoning correctness |
| Search | look up info about something seen | knowledge grounding |
| Image generation | produce a diagram as part of answer | output expressiveness |
Deliberative alignment
Section titled “Deliberative alignment”| Term | Meaning |
|---|---|
| Idea | model reasons explicitly over a safety spec during inference |
| Process | recall relevant rules -> check current request -> then decide |
| Multimodal relevance | new attack surfaces: prompts hidden in images, jailbreaks via visual content |
| Not a guarantee | improves robustness against certain attacks; not a complete solution |
The four-layer stack
Section titled “The four-layer stack”| Layer | What it does | Where it comes from |
|---|---|---|
| 1. Perception | multimodal input handling | L2 encode-then-fuse or L3 native |
| 2. Reasoning training | RL chain-of-thought generation | added on top of base model |
| 3. Tool-use infrastructure | tool calls inside the reasoning loop | platform engineering + training |
| 4. Deliberative alignment | safety reasoning over a written spec | added training stage |
Failure-layer diagnostic
Section titled “Failure-layer diagnostic”| Symptom | Likely layer | Direction of fix |
|---|---|---|
| Correct numbers extracted from chart, wrong sum | reasoning | better CoT training; or call code-exec tool for arithmetic |
| Misread “42” as “12” early in CoT | perception | better encoder, higher resolution, re-examine tool |
| Followed instruction inside image | alignment | stronger deliberative alignment training |
| Guessed confidently instead of zooming on a small detail | tool use | better tool-use training and prompting |
Pitfalls
Section titled “Pitfalls”| Pitfall | Reality |
|---|---|
| ”Reasoning models are just slow VLMs” | extra compute buys real capability, not just sluggishness |
| ”More thinking is always better” | wrong perception + long CoT = amplified hallucination |
| ”Visual reasoning makes the model see better” | it makes the model THINK about what it sees; perception is bounded by the underlying VLM |
| ”Deliberative alignment = safe” | improves robustness; not solved, especially on multimodal attack surface |