Summary: Sequence tools for vision

A lot of vision tasks are sequence-flavoured: captions are sequences of words, videos are sequences of frames, and the Vision Transformer (ViT) treats an image itself as a sequence of patches. The two sequence-processing tools the field uses are recurrence (RNNs and their cousins; one step at a time, hidden state carried forward) and attention (compare every position to every other in parallel, weight by compatibility). This lesson covers both at the level needed for vision applications and routes to sister tracks for the deep architectural mechanics. The unifying point is that when the data or the answer is a sequence, the vision system needs a way to relate positions across it, and these are the two ways.

Core ideas

Three places sequences enter vision. Output is a sequence (image captioning produces words); input is a sequence (video understanding processes frames); the image itself can be a sequence (ViT splits it into a grid of patches, treats them as tokens).
Recurrence (RNN). h_t = f(W · x_t + U · h_{t-1} + b). Read one element at a time, carry a hidden state forward; output per step is read off the hidden state. Vision uses: CNN-plus-RNN image captioning (CNN encodes; RNN decodes word by word) and CNN-plus-RNN video understanding (per-frame CNN features feeding an RNN over time). Limits: sequential (no parallelism), weak at long-range dependencies (LSTM/GRU gating helps but does not fully solve it).
Attention. Attention(Q, K, V) = softmax( Q · K^T / sqrt(d_k) ) · V. Compatibility scores → scaled → softmax → weighted average of values. Two lifts vs RNNs: parallelism (no sequential dependency) and long-range (every output position attends directly to every input). Both lifts from the same change: replace “step through with a hidden state” with “weighted average over all positions, weights from compatibility.”
Vision Transformer (ViT). Replaces convolution with attention. Cut image into 16x16 patches, embed each as a vector, treat as a sequence (with position embeddings), run a transformer encoder, read off a class-token representation, linear classifier on top. Matches or beats ResNet at scale; unified vision and language architecturally (the same transformer block handles both).
Sister-track routing. This lesson stays at the applied-to-vision level by design. Recurrence depth: Track 12 lesson 2 (Why sequences need memory). Attention + full transformer: Track 5’s multi-lesson attention sequence; the upcoming Track 14 (practical transformers) covers it end to end. Training loop unchanged: T16 L3 loss + L4 backprop, regardless of architecture.

What changes for you

Three things you read about modern AI are this lesson in disguise. When a captioning model “looks at” different parts of an image as it writes a caption, you are seeing attention over image regions (lighting up the cat when generating “cat”). When a video model “understands” that someone is catching a ball rather than throwing it, the temporal sequence model on top of per-frame features is what enables that. When a paper or product announcement says “ViT-Large,” “Swin Transformer,” or “DINOv2,” the underlying architecture is the vision-transformer pattern. The architectural unification ViT enabled also explains a lot of multimodal-model headlines: a system that “understands images and text together” usually does so by encoding each modality into the same kind of token stream (patches for images, sub-words for text), running a transformer over the combined stream, and reading off the answer. The same building block does both jobs, which was not architecturally possible before attention.

Sequences enter vision in three places (output, input, internal); recurrence handles them one-at-a-time and attention handles them in-parallel; the vision-transformer’s bet that attention alone can replace convolution has paid off at scale.