Skip to content

Connecting pictures and words, vision and language

This is lesson 14 of Phase 3 (Generating and grounding vision). The one capability it builds: you will be able to explain CLIP’s two-tower contrastive setup, compute image-text cosine similarity, and reason about how zero-shot classification, retrieval, captioning, and VQA all fall out of the same trained joint embedding space. The source curriculum is Stanford CS231n, cs231n.stanford.edu; this lesson maps to Lecture 16 (Vision and Language). Deep VLM mechanics are deferred to sister tracks T24 (planned, vision-language and image generation) and T14 (planned, practical transformers) per the Track 16 Phase 0 arc.

The lesson opens with CLIP’s two-tower setup (image encoder + text encoder + contrastive InfoNCE loss on ~400M web image-text pairs), shows the trained joint embedding space’s structure with one worked cosine similarity by hand, then walks the downstream applications: zero-shot classification (encode class names as text, pick the closest), image-text retrieval (text-to-image and image-to-text search), captioning (add a text decoder with cross-attention to image features), visual question answering (captioning architecture with a question input). It surveys modern vision-language models (VLMs) at the “image encoder + language model + bridge module” level and closes with the economic frame that ties Phase 3 together: image-text pairs are abundant on the web, and CLIP-scale pre-training exploits that abundance.

This is lesson 14 of 16, the fifth lesson of Phase 3. It depends on lesson 10 (self-supervised vision; the contrastive learning ideas extend across modalities here) and lesson 7 (sequence tools for vision; cross-attention is the bridge module’s primary mechanism in captioning and VLMs). The next lesson, Models that imagine the world: world modeling, covers the frontier where vision models learn to predict future frames or states. Lesson 16 closes the track with the human-centered view.

Prerequisites: lessons 7 and 10 of this track. Lesson 10’s contrastive-learning framing (positive pair near in cosine space, negative pair far) transfers directly to CLIP’s cross-modal setup. Lesson 7’s attention is the mechanism behind both transformer text encoders and the cross-attention bridges in captioning and VLMs.

Light. The body computes one image-text cosine similarity by hand (img=[0.6, 0.8] vs matching txt=[0.5, 0.866] → cos ≈ 0.993; vs unrelated txt=[0.8, -0.6] → cos = 0.000; both vectors unit-length so cosine simplifies to dot product). Practice repeats with fresh vectors against three candidates (one strong match, one partial, one anti-match) to land the zero-shot-classification intuition. No calculus required.

  • Describe CLIP’s two-tower contrastive setup
  • Compute image-text cosine similarity
  • Explain zero-shot classification + why prompt engineering matters
  • Distinguish CLIP retrieval, captioning, and VQA architecturally
  • Recognize the Phase 3 economic frame (web image-text pairs as abundant naturally-paired data)
  • Read time: about 14 minutes
  • Practice time: about 15 minutes (a fresh image-text cosine computation against three candidates, an application-matching exercise across CLIP/retrieval/captioning/VQA, a prompt-engineering planning question with the domain-mismatch caveat, plus flashcards)
  • Difficulty: standard (the math is multiplication and addition for cosine; the conceptual lift is seeing how one trained joint embedding space supports so many different downstream tasks)