| Element | Detail |
|---|
| Image encoder | ViT or ResNet → embedding vector |
| Text encoder | Transformer → embedding vector (same dim) |
| Projection | Both projected to shared embedding space, L2-normalized to unit vectors |
| Training data | ~400 million web (image, text) pairs |
| Loss | Symmetric InfoNCE (contrastive): matching pairs close in cosine, mismatched far |
| Output | Geometry of joint embedding space encodes cross-modal semantic structure |
Both vectors unit length, so cosine = dot product:
| Pair | Vectors | Cosine |
|---|
| Image vs matching text | [0.6, 0.8] vs [0.5, 0.866] | ≈ 0.993 (high) |
| Image vs unrelated text | [0.6, 0.8] vs [0.8, -0.6] | 0.000 (orthogonal) |
The shared embedding space holds image and text geometry compatibly.
| Step | Action |
|---|
| 1 | Pick K candidate classes |
| 2 | Build text prompt per class (commonly "a photo of a {class}") |
| 3 | Encode prompts with CLIP text encoder → K text embeddings |
| 4 | At inference: encode input image; pick class with max cosine to image embedding |
| Result | Classifier with no training on target task; competitive with fully-supervised on many benchmarks |
| Direction | Setup |
|---|
| Text → image (gallery search) | Encode image corpus → vector index; encode text query; nearest-neighbour |
| Image → text (caption search) | Reverse: encode text corpus; query with image; nearest-neighbour |
| Component | Detail |
|---|
| Image encoder | CLIP-style or self-supervised pre-trained |
| Text decoder | Transformer language model; generates text one token at a time |
| Bridge | Cross-attention from text decoder to image features |
| Training | Image-caption pairs; standard language-model cross-entropy loss |
| Examples | BLIP, CoCa, LLaVA, successors |
| Compared to captioning | Difference |
|---|
| Text decoder’s prompt | The question (rather than empty/generic) |
| Output | Answer conditioned on image features AND question |
| Modern trend | Subsumed by general-purpose VLMs handling many tasks from one system |
| Component | Role |
|---|
| Pre-trained image encoder | Often ViT, often CLIP-style contrastive-trained |
| Pre-trained language model | Transformer text decoder |
| Bridge module | Adapter or learnable query tokens projecting image features into LM input space |
| Training | Image-text pairs + image-question-answer triples + visual instructions; end-to-end or staged |
| Downstream | Classification, retrieval, captioning, VQA, etc. from one trained system |
| Technique | What it does |
|---|
| Default | "a photo of a {class}" baseline |
| Domain-anchored | "a {medical scan / satellite image / sketch} of a {class}" shifts CLIP to the target register |
| Template ensembling | Encode many templates; average text embeddings; reduces variance |
| When NOT to use | When the test domain is genuinely outside CLIP’s training data; fine-tune or use a domain-pre-trained variant |
| Resource | Property |
|---|
| Hand-labeled data | Expensive (per-image annotation cost) |
| Unlabeled images | Abundant (self-supervised, L10) |
| Image-text pairs from the web | Abundant (every image with caption or alt text) |
CLIP-scale training was feasible because the data was already there. Same shift as L10: stop paying for hand-labels when natural-occurring structure suffices.
| Pitfall | Reality |
|---|
| CLIP = classifier with fixed labels | It’s a joint embedding model; classes are determined by what text you embed |
| ”Zero-shot” means zero training | Means zero training on target task; CLIP itself was trained on 400M pairs |
| CLIP biases = architecture’s fault | Biases come from training data (web statistics); different data, different biases |
| VLMs “understand” images like humans | Statistical pattern-matching on co-occurrence; impressive downstream but not understanding in a human sense |
CLIP’s two-tower contrastive pre-training produced a joint image-text embedding space; zero-shot classification, retrieval, captioning, and VQA all read value from that space’s geometry; modern VLMs generalize the pattern into one trained system handling many tasks; the economic frame is that image-text pairs are abundant on the web and the field has shifted to exploit that.