Convolution: brief

What you’ll learn

This lesson opens Track 12’s second problem shape, images, and introduces the idea that wires a network to see: the convolution. The source curriculum is MIT 6.S191, Lecture 3, by Alexander and Ava Amini, freely available at introtodeeplearning.com.

You will see why the fully-connected network from earlier is the wrong tool for pictures, meet the convolution (a small filter that slides across the image looking for a local pattern), work an edge detector by hand to feel how a filter “lights up” on its pattern, and understand why sharing one small filter across every position is what makes vision networks both efficient and translation-invariant. This sets up the next lesson, where stacked convolutions build edges up into whole objects.

Where this fits

This is lesson 4 of 10, opening Phase 2 (Vision and generation). It leaves the sequence phase behind and starts fresh on images, so it builds on the neural-network basics and the lesson-1 framing (the fully-connected digit network) rather than on the sequence lessons. The next lesson, From edges to objects, stacks the single convolution built here into a full recognition hierarchy.

Before you start

Prerequisites: lesson 1 of this track and the neural-network basics from the previous track (especially what a fully-connected layer is, since the lesson opens by contrasting against it). The sequence lessons (2 and 3) are not required for this one.

About the math

Light and concrete. The only arithmetic is a convolution worked by hand: multiply a small patch of pixels by a small grid of weights and add them up. No calculus, no formulas beyond multiply-and-add. The practice section has you run the same filter on fresh patches.

By the end, you’ll be able to

Explain why a fully-connected layer is the wrong tool for images (parameters, locality, no reuse)
Describe a convolution as a small filter that slides across an image computing a local weighted sum
Compute a convolution by hand and read its response (high where the pattern is present, near zero where absent)
Explain weight-sharing and its two payoffs (far fewer parameters, translation invariance)

Time and difficulty

Read time: about 9 minutes
Practice time: about 10 minutes (a by-hand convolution on fresh patches plus flashcards)
Difficulty: standard (one small, concrete by-hand calculation; otherwise conceptual)