References: Vision and language

Source material

This lesson follows Stanford CS231n’s treatment of vision and language (Lecture 16).

Course: Stanford CS231n, “Deep Learning for Computer Vision”
Instructors: Fei-Fei Li, Ehsan Adeli, and Justin Johnson (Stanford University)
Course site: cs231n.stanford.edu
This lesson maps to: Lecture 16 (Vision and Language).

Attribution (Clawdemy-authored): Stanford CS231n: Deep Learning for Computer Vision, Fei-Fei Li, Ehsan Adeli, and Justin Johnson, Stanford University (cs231n.stanford.edu). CS231n does not publish a required citation string; this is the attribution Clawdemy uses.

A note on access and license

The current term’s lecture recordings are posted on Canvas for enrolled Stanford students. Recordings from previous years are publicly available on YouTube under YouTube’s standard license; Clawdemy links out rather than embedding or rehosting. The course notes (cs231n.github.io) and site are Stanford’s. No Creative Commons license is published for the lectures, so we treat them as link-only references.

Primary papers (cited by name and venue)

CLIP and contrastive vision-language pretraining

CLIP. Radford et al., “Learning Transferable Visual Models From Natural Language Supervision” (ICML 2021). The two-tower contrastive setup that defined the field; ~400M web (image, text) pairs.
ALIGN. Jia et al., “Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision” (ICML 2021). Concurrent / contemporaneous Google work at even larger scale (~1B noisy pairs).
SigLIP. Zhai, Mustafa, Kolesnikov, Beyer, “Sigmoid Loss for Language Image Pre-Training” (ICCV 2023). A modern follow-on with a sigmoid loss (instead of softmax-contrastive) that scales differently.

Captioning architectures

Show, Attend and Tell. Xu et al., “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention” (ICML 2015). The historical baseline for attention-based captioning (also cited in lesson 7).
BLIP / BLIP-2. Li, Li, Xiong, Hoi, “BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation” (ICML 2022); Li et al., “BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models” (ICML 2023). Influential captioning and VLM architectures.
CoCa. Yu et al., “CoCa: Contrastive Captioners are Image-Text Foundation Models” (TMLR 2022). A unified contrastive + captioning system.

Visual question answering

VQA dataset and task. Antol et al., “VQA: Visual Question Answering” (ICCV 2015). The dataset and task formulation that launched modern VQA work.

Modern vision-language models

LLaVA. Liu, Li, Wu, Lee, “Visual Instruction Tuning” (NeurIPS 2023). Open-source visual-instruction-tuned VLM; influential reference architecture.
Flamingo. Alayrac et al., “Flamingo: a Visual Language Model for Few-Shot Learning” (NeurIPS 2022). Few-shot multimodal VLM that demonstrated the in-context-learning paradigm for vision-language.
PaLI. Chen et al., “PaLI: A Jointly-Scaled Multilingual Language-Image Model” (ICLR 2023). Large-scale multilingual vision-language model.

Domain-specific CLIP variants (for context)

BiomedCLIP. Zhang et al., “BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs” (Nature Methods / arXiv 2023). Domain-adapted CLIP for biomedical imagery.
RemoteCLIP. Liu, Chen, Mei, Liu, Li, Hu, “RemoteCLIP: A Vision Language Foundation Model for Remote Sensing” (IEEE TGRS 2024). Domain-adapted CLIP for satellite/aerial imagery.

Further study (deeper mechanics in sister tracks)

T24 (planned, image generation and multimodal). Will cover modern vision-language models end-to-end (LLaVA, BLIP-2, Flamingo-style architectures), including the bridge module designs (Q-Former, learnable query tokens, adapter networks), staged training schedules, and instruction tuning for vision tasks. The right destination if you want to actually train or fine-tune a VLM.
T14 (planned, practical transformers). Will cover transformer attention mechanism in depth (multi-head attention, cross-attention, positional encoding, scaling); the attention used in CLIP’s text tower and VLM cross-attention bridges is the same machinery covered there.
T5 (AI Foundations). Has a multi-lesson sequence on attention and transformer blocks that is the canonical Clawdemy reference for attention mechanics; cross-link from L7 of this track.

Further study (tools and reproduction)

OpenCLIP (github.com/mlfoundations/open_clip): open-source reproduction of CLIP with many model variants (different scales, training data, encoders); the de-facto practitioner reference.
Hugging Face transformers + CLIP: the most-used interface for loading and using CLIP-style models in production.
LLaVA open-source release and reproductions: the most-cited entry point into the open VLM ecosystem.

How we use this source

Clawdemy follows CS231n’s Lec 16 ordering (CLIP setup, zero-shot, retrieval, captioning, VQA, modern VLMs) and cites the canonical papers by name and venue. The cosine-similarity worked examples (body: img=[0.6, 0.8] vs txt_cat=[0.5, 0.866] → cos ≈ 0.993; vs txt_car=[0.8, -0.6] → cos = 0.000; practice: img=[0.8, 0.6] against three candidates) are Clawdemy-authored against the standard cosine formula. The prompt-engineering exercise and the “domain mismatch → fine-tune or use domain variant” framing in practice are Clawdemy-authored, summarizing well-known practitioner consensus. We do not name specific commercial vision-language products by brand; that maintains Clawdemy’s vendor-neutral framing per the curriculum’s worked-environment rule. We do not reproduce CS231n’s slides, figures, problem sets, or lecture text. Full attribution policy: see Doc/attribution-policy.md.