Skip to content

Cheatsheet: Sequence tools for vision

WhereExample taskTool
OutputImage captioning (sequence of words)RNN or attention decoder
InputVideo understanding (sequence of frames)RNN, temporal CNN, or transformer over time
InternalVision Transformer (image as sequence of patches)Transformer encoder
ElementDetail
Update ruleh_t = f(W · x_t + U · h_{t-1} + b)
Hidden stateNetwork’s running memory, carried forward step by step
Output per stepRead off h_t, optionally via another linear layer
VariantsLSTM, GRU (add gates to control what’s remembered/forgotten)
LimitsSequential (no parallelism); weak at long-range dependencies
ElementDetail
FormulaAttention(Q, K, V) = softmax( Q · K^T / sqrt(d_k) ) · V
Q, K, VQueries, keys, values; matrices derived from input
ScoreQ · K^T (dot products, query vs each key)
ScaleDivide by sqrt(d_k) to keep scores stable at large key dim
WeightsSoftmax of scaled scores → probability distribution over positions
OutputWeighted average of V using those probabilities
LiftWhy
ParallelismAll positions computed at once; no sequential dependency between steps
Long-rangeEvery output position attends directly to every input position; no distance penalty

Both come from the same change: “step through with hidden state” → “weighted average over all, weighted by compatibility.”

StepComputationResult
Inputsq = [1, 0], K rows [1,0], [0,1], [1,1], V rows [1,0], [0,1], [1,1]
Scoresq · k_i[1, 0, 1]
Scaled÷ sqrt(2)[0.707, 0, 0.707]
Softmaxover scaled[0.401, 0.198, 0.401]
Outputweighted sum of V[0.802, 0.599]
ArchitectureEncoderSequence modelWhere
Classic CNN-RNN captioningCNNRNN decoderImage in, caption out
Modern CNN-attention captioningCNN feature mapsAttention decoder over regionsImage in, caption out (with attention maps over image)
CNN-RNN video understandingPer-frame CNNRNN over frame featuresVideo in, action/caption/etc out
Vision Transformer (ViT)None (patches direct)Transformer encoder over patchesImage-as-sequence; replaces convolution
StepWhat
1Cut image into grid of small patches (commonly 16x16)
2Flatten + linearly embed each patch as a vector
3Add learned position embeddings so model knows where each patch sat
4Prepend a learnable “class token” position
5Run a standard transformer encoder over the patch sequence
6Read off the class-token representation; linear classifier produces K scores
TopicDepth
RNN, LSTM, GRU mechanicsTrack 12 lesson 2 (Why sequences need memory)
Attention + transformer blocks (multi-head, full encoder/decoder)Track 5 multi-lesson sequence; Track 14 (practical transformers)
Linear + non-linearity + backprop machinery underneath bothTrack 11 lessons 3-4 (Neural Network Intuition)
The training loop on topT16 L3 loss + L4 backprop, unchanged for any architecture
PitfallReality
RNN’s hidden state = its weightsNo; state changes step by step within a sequence, weights are fixed until backprop updates
Attention is a separate idea from NNIt’s a layer using the same multiply-add-softmax machinery; the “weights” (attention probs) depend on the input
ViT = CNN + attentionNo; ViT REPLACES convolution. Pure ViT has no conv
Transformer always beats CNNTrade-off; transformers need more data + compute to outperform but scale further

Sequences enter vision in three places (output, input, internal); recurrence handles them one-at-a-time, attention handles them in-parallel; ViT’s bet that attention alone can replace convolution paid off at scale; sister tracks own the deep mechanics.