Skip to content

Cheatsheet: Reasoning over multimodal inputs

AspectStandard LLMReasoning model
Output styleone forward passlong internal chain of thought, then answer
Inference computelow per querymuch higher per query
Strengthretrieval-style recalldeliberate multi-step reasoning
CapabilityBeats a single-pass VLM at
Visual problem solvingmath from images, chart-data computation
Self-checking across modalitiescatching its own misreading; correcting course
Grounded multi-step reasoninganswers citing visual evidence step by step

Tools a multimodal reasoning model commonly calls

Section titled “Tools a multimodal reasoning model commonly calls”
Tool categoryExamplesWhat it extends
Vision toolszoom, crop, OCR-on-regionperception (re-examine closely)
Code executionarithmetic, plotting, verificationreasoning correctness
Searchlook up info about something seenknowledge grounding
Image generationproduce a diagram as part of answeroutput expressiveness
TermMeaning
Ideamodel reasons explicitly over a safety spec during inference
Processrecall relevant rules -> check current request -> then decide
Multimodal relevancenew attack surfaces: prompts hidden in images, jailbreaks via visual content
Not a guaranteeimproves robustness against certain attacks; not a complete solution
LayerWhat it doesWhere it comes from
1. Perceptionmultimodal input handlingL2 encode-then-fuse or L3 native
2. Reasoning trainingRL chain-of-thought generationadded on top of base model
3. Tool-use infrastructuretool calls inside the reasoning loopplatform engineering + training
4. Deliberative alignmentsafety reasoning over a written specadded training stage
SymptomLikely layerDirection of fix
Correct numbers extracted from chart, wrong sumreasoningbetter CoT training; or call code-exec tool for arithmetic
Misread “42” as “12” early in CoTperceptionbetter encoder, higher resolution, re-examine tool
Followed instruction inside imagealignmentstronger deliberative alignment training
Guessed confidently instead of zooming on a small detailtool usebetter tool-use training and prompting
PitfallReality
”Reasoning models are just slow VLMs”extra compute buys real capability, not just sluggishness
”More thinking is always better”wrong perception + long CoT = amplified hallucination
”Visual reasoning makes the model see better”it makes the model THINK about what it sees; perception is bounded by the underlying VLM
”Deliberative alignment = safe”improves robustness; not solved, especially on multimodal attack surface