Skip to content

References: Convolution and CNNs

This lesson follows Stanford CS231n’s treatment of the convolutional layer, the workhorse architecture of computer vision.

  • Course: Stanford CS231n, “Deep Learning for Computer Vision”
  • Instructors: Fei-Fei Li, Ehsan Adeli, and Justin Johnson (Stanford University)
  • Course site: cs231n.stanford.edu
  • Course notes (convolutional networks): cs231n.github.io/convolutional-networks (the canonical write-up of the conv operation, filter shapes, the three hyperparameters K/S/P, the output spatial-size formula, weight sharing, and the AlexNet first-layer parameter-count example).
  • This lesson maps to: Lecture 5 (Image Classification with CNNs).

Attribution (Clawdemy-authored): Stanford CS231n: Deep Learning for Computer Vision, Fei-Fei Li, Ehsan Adeli, and Justin Johnson, Stanford University (cs231n.stanford.edu). CS231n does not publish a required citation string; this is the attribution Clawdemy uses.

The current term’s lecture recordings are posted on Canvas for enrolled Stanford students. Recordings from previous years are publicly available on YouTube under YouTube’s standard license; Clawdemy links out rather than embedding or rehosting. The course notes (cs231n.github.io) and site are Stanford’s. No Creative Commons license is published for the lectures, so we treat them as link-only references.

  • CS231n convolutional-networks notes. cs231n.github.io/convolutional-networks goes substantially deeper: pooling layers, dilated convolutions, full-volume diagrams of how feature maps stack through a network, and several worked examples beyond the ones cited here.
  • AlexNet paper. Krizhevsky, Sutskever, Hinton, “ImageNet Classification with Deep Convolutional Neural Networks” (NeurIPS 2012); the 11x11x3 first-layer example used in this lesson is from AlexNet’s first conv layer.
  • Introduction to Deep Learning (Track 12, Clawdemy), lessons 4 and 5. “How machines see: convolution” and “From edges to objects” cover the same operation at survey-level intuition; T16 readers who want a gentler first pass before this lesson’s more formal treatment will find it there.
  • Neural Network Intuition (Track 11, Clawdemy), lessons 1-2. The handwritten-digit rasterization used in T11’s diagrams is the same pixel-grid-as-numbers picture that motivates the local-patch structure of convolution.

Clawdemy follows CS231n’s pedagogical ordering (motivate against FC for images, define convolution, name the three hyperparameters, give the output formula, then weight sharing and parameter counts). The verbatim CS231n numbers carried in this lesson are: the 200x200x3 FC neuron at 120,000 weights; the 7x7 input / 3x3 filter / stride 1 / pad 0 -> 5x5 output (and stride 2 -> 3x3) examples; the AlexNet first layer at 96 filters of 11x11x3 = 34,848 weights + 96 biases = 34,944 parameters; the parameter-sharing rationale quote (“if one feature is useful at (x,y), it should be useful at (x’,y’)”). The 5x5 vertical-edge worked example in the body and the 5x5 horizontal-edge example, output-size practice, and parameter-comparison exercise in practice are Clawdemy-authored against the CS231n framing. We do not reproduce CS231n’s slides, figures, or problem sets. Full attribution policy: see Doc/attribution-policy.md.