Sequence tools for vision: cheatsheet

Three places sequences enter vision

Where	Example task	Tool
Output	Image captioning (sequence of words)	RNN or attention decoder
Input	Video understanding (sequence of frames)	RNN, temporal CNN, or transformer over time
Internal	Vision Transformer (image as sequence of patches)	Transformer encoder

Recurrence (RNN)

Element	Detail
Update rule	`h_t = f(W · x_t + U · h_{t-1} + b)`
Hidden state	Network’s running memory, carried forward step by step
Output per step	Read off `h_t`, optionally via another linear layer
Variants	LSTM, GRU (add gates to control what’s remembered/forgotten)
Limits	Sequential (no parallelism); weak at long-range dependencies

Attention

Element	Detail
Formula	`Attention(Q, K, V) = softmax( Q · K^T / sqrt(d_k) ) · V`
Q, K, V	Queries, keys, values; matrices derived from input
Score	`Q · K^T` (dot products, query vs each key)
Scale	Divide by `sqrt(d_k)` to keep scores stable at large key dim
Weights	Softmax of scaled scores → probability distribution over positions
Output	Weighted average of `V` using those probabilities

Two lifts attention gives over RNN

Lift	Why
Parallelism	All positions computed at once; no sequential dependency between steps
Long-range	Every output position attends directly to every input position; no distance penalty

Both come from the same change: “step through with hidden state” → “weighted average over all, weighted by compatibility.”

Worked attention example (body, d_k = 2)

Step	Computation	Result
Inputs	`q = [1, 0]`, K rows `[1,0], [0,1], [1,1]`, V rows `[1,0], [0,1], [1,1]`
Scores	`q · k_i`	`[1, 0, 1]`
Scaled	`÷ sqrt(2)`	`[0.707, 0, 0.707]`
Softmax	over scaled	`[0.401, 0.198, 0.401]`
Output	weighted sum of V	`[0.802, 0.599]`

Vision applications

Architecture	Encoder	Sequence model	Where
Classic CNN-RNN captioning	CNN	RNN decoder	Image in, caption out
Modern CNN-attention captioning	CNN feature maps	Attention decoder over regions	Image in, caption out (with attention maps over image)
CNN-RNN video understanding	Per-frame CNN	RNN over frame features	Video in, action/caption/etc out
Vision Transformer (ViT)	None (patches direct)	Transformer encoder over patches	Image-as-sequence; replaces convolution

Vision Transformer (ViT) recipe

Step	What
1	Cut image into grid of small patches (commonly 16x16)
2	Flatten + linearly embed each patch as a vector
3	Add learned position embeddings so model knows where each patch sat
4	Prepend a learnable “class token” position
5	Run a standard transformer encoder over the patch sequence
6	Read off the class-token representation; linear classifier produces K scores

Sister-track routing (mechanics)

Topic	Depth
RNN, LSTM, GRU mechanics	Track 12 lesson 2 (Why sequences need memory)
Attention + transformer blocks (multi-head, full encoder/decoder)	Track 5 multi-lesson sequence; Track 14 (practical transformers)
Linear + non-linearity + backprop machinery underneath both	Track 11 lessons 3-4 (Neural Network Intuition)
The training loop on top	T16 L3 loss + L4 backprop, unchanged for any architecture

Pitfalls

Pitfall	Reality
RNN’s hidden state = its weights	No; state changes step by step within a sequence, weights are fixed until backprop updates
Attention is a separate idea from NN	It’s a layer using the same multiply-add-softmax machinery; the “weights” (attention probs) depend on the input
ViT = CNN + attention	No; ViT REPLACES convolution. Pure ViT has no conv
Transformer always beats CNN	Trade-off; transformers need more data + compute to outperform but scale further

One-line takeaway

Sequences enter vision in three places (output, input, internal); recurrence handles them one-at-a-time, attention handles them in-parallel; ViT’s bet that attention alone can replace convolution paid off at scale; sister tracks own the deep mechanics.