Skip to content

Cheatsheet: Vision and language

ElementDetail
Image encoderViT or ResNet → embedding vector
Text encoderTransformer → embedding vector (same dim)
ProjectionBoth projected to shared embedding space, L2-normalized to unit vectors
Training data~400 million web (image, text) pairs
LossSymmetric InfoNCE (contrastive): matching pairs close in cosine, mismatched far
OutputGeometry of joint embedding space encodes cross-modal semantic structure

Both vectors unit length, so cosine = dot product:

PairVectorsCosine
Image vs matching text[0.6, 0.8] vs [0.5, 0.866]≈ 0.993 (high)
Image vs unrelated text[0.6, 0.8] vs [0.8, -0.6]0.000 (orthogonal)

The shared embedding space holds image and text geometry compatibly.

StepAction
1Pick K candidate classes
2Build text prompt per class (commonly "a photo of a {class}")
3Encode prompts with CLIP text encoder → K text embeddings
4At inference: encode input image; pick class with max cosine to image embedding
ResultClassifier with no training on target task; competitive with fully-supervised on many benchmarks
DirectionSetup
Text → image (gallery search)Encode image corpus → vector index; encode text query; nearest-neighbour
Image → text (caption search)Reverse: encode text corpus; query with image; nearest-neighbour
ComponentDetail
Image encoderCLIP-style or self-supervised pre-trained
Text decoderTransformer language model; generates text one token at a time
BridgeCross-attention from text decoder to image features
TrainingImage-caption pairs; standard language-model cross-entropy loss
ExamplesBLIP, CoCa, LLaVA, successors
Compared to captioningDifference
Text decoder’s promptThe question (rather than empty/generic)
OutputAnswer conditioned on image features AND question
Modern trendSubsumed by general-purpose VLMs handling many tasks from one system
ComponentRole
Pre-trained image encoderOften ViT, often CLIP-style contrastive-trained
Pre-trained language modelTransformer text decoder
Bridge moduleAdapter or learnable query tokens projecting image features into LM input space
TrainingImage-text pairs + image-question-answer triples + visual instructions; end-to-end or staged
DownstreamClassification, retrieval, captioning, VQA, etc. from one trained system
TechniqueWhat it does
Default"a photo of a {class}" baseline
Domain-anchored"a {medical scan / satellite image / sketch} of a {class}" shifts CLIP to the target register
Template ensemblingEncode many templates; average text embeddings; reduces variance
When NOT to useWhen the test domain is genuinely outside CLIP’s training data; fine-tune or use a domain-pre-trained variant
ResourceProperty
Hand-labeled dataExpensive (per-image annotation cost)
Unlabeled imagesAbundant (self-supervised, L10)
Image-text pairs from the webAbundant (every image with caption or alt text)

CLIP-scale training was feasible because the data was already there. Same shift as L10: stop paying for hand-labels when natural-occurring structure suffices.

PitfallReality
CLIP = classifier with fixed labelsIt’s a joint embedding model; classes are determined by what text you embed
”Zero-shot” means zero trainingMeans zero training on target task; CLIP itself was trained on 400M pairs
CLIP biases = architecture’s faultBiases come from training data (web statistics); different data, different biases
VLMs “understand” images like humansStatistical pattern-matching on co-occurrence; impressive downstream but not understanding in a human sense

CLIP’s two-tower contrastive pre-training produced a joint image-text embedding space; zero-shot classification, retrieval, captioning, and VQA all read value from that space’s geometry; modern VLMs generalize the pattern into one trained system handling many tasks; the economic frame is that image-text pairs are abundant on the web and the field has shifted to exploit that.