Sequence tools for vision: brief

What you’ll learn

This is lesson 7 of Phase 2 (How machines see). The one capability it builds: you will be able to identify when a vision system needs sequence tools, name which tool (recurrence or attention), compute one attention head by hand, and locate the deep architectural mechanics in the right sister track. The source curriculum is Stanford CS231n, cs231n.stanford.edu; this lesson combines Lectures 7 (RNNs) and 8 (Attention and Transformers) per the Track 16 Phase 0 arc, with deep mechanics deferred to sister tracks.

The lesson opens with the three places sequences enter computer vision (caption output, video input, ViT internal). It walks recurrence (RNN update rule, hidden state, structural limits of sequential and weak-at-long-range). It walks attention (Q/K/V formula, the two lifts of parallelism and long-range from one change). It covers vision applications (classic CNN-RNN captioning, modern CNN-attention captioning, CNN-RNN video, and the Vision Transformer that replaces convolution entirely). It explicitly states the Phase 0 combine-rationale and routes to sister tracks (Track 12 L2 for recurrence; Track 5 multi-lesson + Track 14 for attention/transformers).

Where this fits

This is lesson 7 of 16, the third lesson of Phase 2. It depends on lessons 5 and 6 (the conv layer and the CNN architectures it stacks into; CNN-plus-RNN and CNN-plus-attention systems use the conv layers from L5 as encoders). The next lesson, Beyond “what is it”: detection, segmentation, and seeing inside the net, covers richer vision tasks (object detection, semantic segmentation, and feature-map visualization).

Before you start

Prerequisites: lessons 5 and 6 of this track (the conv layer and the architectures it stacks into; these are the encoders that feed the sequence models here). If you want the deep architectural mechanics this lesson defers, Track 12 lesson 2 covers recurrence; Track 5’s multi-lesson sequence and Track 14 cover attention/transformers.

About the math

Light. The body cites the basic RNN update rule (one line) and the standard attention formula (one line), and works one attention head by hand on a 2D toy (d_k = 2, one query, three keys, three values → output via scores → scaled → softmax → weighted sum of V). Practice repeats with fresh numbers. No calculus; multiplication, addition, three exponentials, one softmax, one weighted sum.

By the end, you’ll be able to

Name the three places sequences enter CV with a task example each
Write the RNN update rule and name the two structural limits
Write the attention formula and compute one head by hand
Describe the ViT architecture and name what it replaces
Locate the deep mechanics in the right sister track

Time and difficulty

Read time: about 13 minutes
Practice time: about 15 minutes (a fresh attention computation, an architecture-matching exercise, a reasoning question about attention maps, plus flashcards)
Difficulty: standard (the math is multiplication, exponentials, and one softmax; the conceptual lift is seeing the parallel-and-long-range lifts come from one architectural change)