Skip to content

How machines see: convolution

This lesson opens Track 12’s second problem shape, images, and introduces the idea that wires a network to see: the convolution. The source curriculum is MIT 6.S191, Lecture 3, by Alexander and Ava Amini, freely available at introtodeeplearning.com.

You will see why the fully-connected network from earlier is the wrong tool for pictures, meet the convolution (a small filter that slides across the image looking for a local pattern), work an edge detector by hand to feel how a filter “lights up” on its pattern, and understand why sharing one small filter across every position is what makes vision networks both efficient and translation-invariant. This sets up the next lesson, where stacked convolutions build edges up into whole objects.

This is lesson 4 of 10, opening Phase 2 (Vision and generation). It leaves the sequence phase behind and starts fresh on images, so it builds on the neural-network basics and the lesson-1 framing (the fully-connected digit network) rather than on the sequence lessons. The next lesson, From edges to objects, stacks the single convolution built here into a full recognition hierarchy.

Prerequisites: lesson 1 of this track and the neural-network basics from the previous track (especially what a fully-connected layer is, since the lesson opens by contrasting against it). The sequence lessons (2 and 3) are not required for this one.

Light and concrete. The only arithmetic is a convolution worked by hand: multiply a small patch of pixels by a small grid of weights and add them up. No calculus, no formulas beyond multiply-and-add. The practice section has you run the same filter on fresh patches.

  • Explain why a fully-connected layer is the wrong tool for images (parameters, locality, no reuse)
  • Describe a convolution as a small filter that slides across an image computing a local weighted sum
  • Compute a convolution by hand and read its response (high where the pattern is present, near zero where absent)
  • Explain weight-sharing and its two payoffs (far fewer parameters, translation invariance)
  • Read time: about 9 minutes
  • Practice time: about 10 minutes (a by-hand convolution on fresh patches plus flashcards)
  • Difficulty: standard (one small, concrete by-hand calculation; otherwise conceptual)