Vision and language: cheatsheet

CLIP two-tower setup

Element	Detail
Image encoder	ViT or ResNet → embedding vector
Text encoder	Transformer → embedding vector (same dim)
Projection	Both projected to shared embedding space, L2-normalized to unit vectors
Training data	~400 million web (image, text) pairs
Loss	Symmetric InfoNCE (contrastive): matching pairs close in cosine, mismatched far
Output	Geometry of joint embedding space encodes cross-modal semantic structure

Worked image-text cosine (body)

Both vectors unit length, so cosine = dot product:

Pair	Vectors	Cosine
Image vs matching text	`[0.6, 0.8]` vs `[0.5, 0.866]`	≈ 0.993 (high)
Image vs unrelated text	`[0.6, 0.8]` vs `[0.8, -0.6]`	0.000 (orthogonal)

The shared embedding space holds image and text geometry compatibly.

Zero-shot classification (recipe)

Step	Action
1	Pick K candidate classes
2	Build text prompt per class (commonly `"a photo of a {class}"`)
3	Encode prompts with CLIP text encoder → K text embeddings
4	At inference: encode input image; pick class with max cosine to image embedding
Result	Classifier with no training on target task; competitive with fully-supervised on many benchmarks

Image-text retrieval (both directions)

Direction	Setup
Text → image (gallery search)	Encode image corpus → vector index; encode text query; nearest-neighbour
Image → text (caption search)	Reverse: encode text corpus; query with image; nearest-neighbour

Captioning architecture

Component	Detail
Image encoder	CLIP-style or self-supervised pre-trained
Text decoder	Transformer language model; generates text one token at a time
Bridge	Cross-attention from text decoder to image features
Training	Image-caption pairs; standard language-model cross-entropy loss
Examples	BLIP, CoCa, LLaVA, successors

VQA: captioning + a question

Compared to captioning	Difference
Text decoder’s prompt	The question (rather than empty/generic)
Output	Answer conditioned on image features AND question
Modern trend	Subsumed by general-purpose VLMs handling many tasks from one system

Modern VLM shape

Component	Role
Pre-trained image encoder	Often ViT, often CLIP-style contrastive-trained
Pre-trained language model	Transformer text decoder
Bridge module	Adapter or learnable query tokens projecting image features into LM input space
Training	Image-text pairs + image-question-answer triples + visual instructions; end-to-end or staged
Downstream	Classification, retrieval, captioning, VQA, etc. from one trained system

Prompt engineering for CLIP zero-shot

Technique	What it does
Default	`"a photo of a {class}"` baseline
Domain-anchored	`"a {medical scan / satellite image / sketch} of a {class}"` shifts CLIP to the target register
Template ensembling	Encode many templates; average text embeddings; reduces variance
When NOT to use	When the test domain is genuinely outside CLIP’s training data; fine-tune or use a domain-pre-trained variant

Economic frame (closes Phase 3)

Resource	Property
Hand-labeled data	Expensive (per-image annotation cost)
Unlabeled images	Abundant (self-supervised, L10)
Image-text pairs from the web	Abundant (every image with caption or alt text)

CLIP-scale training was feasible because the data was already there. Same shift as L10: stop paying for hand-labels when natural-occurring structure suffices.

Pitfalls

Pitfall	Reality
CLIP = classifier with fixed labels	It’s a joint embedding model; classes are determined by what text you embed
”Zero-shot” means zero training	Means zero training on target task; CLIP itself was trained on 400M pairs
CLIP biases = architecture’s fault	Biases come from training data (web statistics); different data, different biases
VLMs “understand” images like humans	Statistical pattern-matching on co-occurrence; impressive downstream but not understanding in a human sense

One-line takeaway

CLIP’s two-tower contrastive pre-training produced a joint image-text embedding space; zero-shot classification, retrieval, captioning, and VQA all read value from that space’s geometry; modern VLMs generalize the pattern into one trained system handling many tasks; the economic frame is that image-text pairs are abundant on the web and the field has shifted to exploit that.