| Where | Example task | Tool |
|---|
| Output | Image captioning (sequence of words) | RNN or attention decoder |
| Input | Video understanding (sequence of frames) | RNN, temporal CNN, or transformer over time |
| Internal | Vision Transformer (image as sequence of patches) | Transformer encoder |
| Element | Detail |
|---|
| Update rule | h_t = f(W · x_t + U · h_{t-1} + b) |
| Hidden state | Network’s running memory, carried forward step by step |
| Output per step | Read off h_t, optionally via another linear layer |
| Variants | LSTM, GRU (add gates to control what’s remembered/forgotten) |
| Limits | Sequential (no parallelism); weak at long-range dependencies |
| Element | Detail |
|---|
| Formula | Attention(Q, K, V) = softmax( Q · K^T / sqrt(d_k) ) · V |
| Q, K, V | Queries, keys, values; matrices derived from input |
| Score | Q · K^T (dot products, query vs each key) |
| Scale | Divide by sqrt(d_k) to keep scores stable at large key dim |
| Weights | Softmax of scaled scores → probability distribution over positions |
| Output | Weighted average of V using those probabilities |
| Lift | Why |
|---|
| Parallelism | All positions computed at once; no sequential dependency between steps |
| Long-range | Every output position attends directly to every input position; no distance penalty |
Both come from the same change: “step through with hidden state” → “weighted average over all, weighted by compatibility.”
| Step | Computation | Result |
|---|
| Inputs | q = [1, 0], K rows [1,0], [0,1], [1,1], V rows [1,0], [0,1], [1,1] | |
| Scores | q · k_i | [1, 0, 1] |
| Scaled | ÷ sqrt(2) | [0.707, 0, 0.707] |
| Softmax | over scaled | [0.401, 0.198, 0.401] |
| Output | weighted sum of V | [0.802, 0.599] |
| Architecture | Encoder | Sequence model | Where |
|---|
| Classic CNN-RNN captioning | CNN | RNN decoder | Image in, caption out |
| Modern CNN-attention captioning | CNN feature maps | Attention decoder over regions | Image in, caption out (with attention maps over image) |
| CNN-RNN video understanding | Per-frame CNN | RNN over frame features | Video in, action/caption/etc out |
| Vision Transformer (ViT) | None (patches direct) | Transformer encoder over patches | Image-as-sequence; replaces convolution |
| Step | What |
|---|
| 1 | Cut image into grid of small patches (commonly 16x16) |
| 2 | Flatten + linearly embed each patch as a vector |
| 3 | Add learned position embeddings so model knows where each patch sat |
| 4 | Prepend a learnable “class token” position |
| 5 | Run a standard transformer encoder over the patch sequence |
| 6 | Read off the class-token representation; linear classifier produces K scores |
| Topic | Depth |
|---|
| RNN, LSTM, GRU mechanics | Track 12 lesson 2 (Why sequences need memory) |
| Attention + transformer blocks (multi-head, full encoder/decoder) | Track 5 multi-lesson sequence; Track 14 (practical transformers) |
| Linear + non-linearity + backprop machinery underneath both | Track 11 lessons 3-4 (Neural Network Intuition) |
| The training loop on top | T16 L3 loss + L4 backprop, unchanged for any architecture |
| Pitfall | Reality |
|---|
| RNN’s hidden state = its weights | No; state changes step by step within a sequence, weights are fixed until backprop updates |
| Attention is a separate idea from NN | It’s a layer using the same multiply-add-softmax machinery; the “weights” (attention probs) depend on the input |
| ViT = CNN + attention | No; ViT REPLACES convolution. Pure ViT has no conv |
| Transformer always beats CNN | Trade-off; transformers need more data + compute to outperform but scale further |
Sequences enter vision in three places (output, input, internal); recurrence handles them one-at-a-time, attention handles them in-parallel; ViT’s bet that attention alone can replace convolution paid off at scale; sister tracks own the deep mechanics.