References: Sequence tools for vision
Source material
Section titled “Source material”This lesson combines Stanford CS231n’s treatments of recurrence and attention/transformers as they apply to vision (per the Track 16 Phase 0 arc); the deep architectural mechanics live in sister tracks.
- Course: Stanford CS231n, “Deep Learning for Computer Vision”
- Instructors: Fei-Fei Li, Ehsan Adeli, and Justin Johnson (Stanford University)
- Course site: cs231n.stanford.edu
- This lesson maps to: Lecture 7 (Recurrent Neural Networks) + Lecture 8 (Attention and Transformers), combined per the Track 16 Phase 0 arc with deep mechanics deferred to sister tracks.
Attribution (Clawdemy-authored): Stanford CS231n: Deep Learning for Computer Vision, Fei-Fei Li, Ehsan Adeli, and Justin Johnson, Stanford University (cs231n.stanford.edu). CS231n does not publish a required citation string; this is the attribution Clawdemy uses.
A note on access and license
Section titled “A note on access and license”The current term’s lecture recordings are posted on Canvas for enrolled Stanford students. Recordings from previous years are publicly available on YouTube under YouTube’s standard license; Clawdemy links out rather than embedding or rehosting. The course notes (cs231n.github.io) and site are Stanford’s. No Creative Commons license is published for the lectures, so we treat them as link-only references.
Primary papers (cited by name and venue)
Section titled “Primary papers (cited by name and venue)”- Image captioning (CNN-RNN classic). Vinyals et al., “Show and Tell: A Neural Image Caption Generator” (CVPR 2015) and Karpathy & Fei-Fei, “Deep Visual-Semantic Alignments for Generating Image Descriptions” (CVPR 2015).
- Image captioning with attention. Xu et al., “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention” (ICML 2015), the paper that introduced attention over image regions for captioning.
- Attention is all you need. Vaswani et al., “Attention Is All You Need” (NeurIPS 2017), the transformer paper.
- Vision Transformer (ViT). Dosovitskiy et al., “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale” (ICLR 2021), the paper that established attention-only architectures could match or beat CNNs on ImageNet at scale.
Sister-track routing (Clawdemy)
Section titled “Sister-track routing (Clawdemy)”This lesson stays at the applied-to-vision level by Phase 0 design. For deeper mechanics, the right destinations are:
- Recurrence (RNN, LSTM, GRU mechanics). Track 12 lesson 2 (Why sequences need memory) covers the recurrence loop in a generic NN setting.
- Attention and transformer blocks (Q/K/V, multi-head attention, full transformer encoder/decoder, positional encoding). Track 5 (AI Foundations) has a multi-lesson sequence covering attention, multi-head attention, and the full transformer block in depth. Track 14 (Practical Transformers, upcoming) covers transformer-style architectures end to end with code.
- The linear + non-linearity + backprop machinery underneath all of it. Track 11 (Neural Network Intuition) lessons 3-4 establish the per-neuron computation and the backprop loop both tools run on top of.
Further study
Section titled “Further study”- Karpathy’s “The Unreasonable Effectiveness of Recurrent Neural Networks” (blog post, 2015). The classic, accessible introduction to character-level RNNs.
- Olah’s “Understanding LSTM Networks” (blog post, 2015). The widely-cited gentle introduction to LSTM cells with the canonical diagrams.
- “The Annotated Transformer” (Harvard NLP, 2018). A code-annotated walk through Vaswani et al.’s transformer paper; an excellent companion when you read the primary paper.
How we use this source
Section titled “How we use this source”Clawdemy follows CS231n’s pedagogical ordering (recurrence first, attention second, then transformer applied to vision via ViT). The basic RNN formula h_t = f(W · x_t + U · h_{t-1} + b) and the standard attention formula Attention(Q, K, V) = softmax(Q · K^T / sqrt(d_k)) · V are canonical and appear in essentially all standard references on these topics; both are presented here in their textbook forms. The worked attention example in the body (q = [1, 0], three keys, three values, d_k = 2 → output [0.802, 0.599]) and the fresh example in practice (q = [0, 1], different keys/values → output [0.744, 1.254]) are Clawdemy-authored. We do not reproduce CS231n’s slides, figures, or problem sets. Full attribution policy: see Doc/attribution-policy.md.