Video understanding: brief

What you’ll learn

This is lesson 9 of Phase 2 (How machines see), and the last lesson before we leave recognition behind for generation in Phase 3. The one capability it builds: you will be able to walk the standard architecture ladder for video (single-frame baseline through video transformer), place each landmark architecture (C3D, I3D, two-stream, SlowFast, TimeSformer, ViViT) on it, compute the 2D-vs-3D-conv parameter-cost ratio, and reason about temporal sampling and the single-frame-baseline discipline. The source curriculum is Stanford CS231n, cs231n.stanford.edu; this lesson maps to Lecture 10 (Video Understanding).

The lesson opens with what a video actually is numerically (4D tensor T × H × W × C, modest clips at millions of numbers, real clips at hundreds of millions, hence the need for temporal sampling). It walks the architecture ladder from the simplest (single-frame, late fusion, early fusion) through 3D convolutions and two-stream networks, cross-links the CNN+RNN family to lesson 7, and ends with modern video transformers (TimeSformer’s divided attention, ViViT’s factorizations). The parameter-count comparison (2D 27 vs 3D 81 weights per filter at 3-spatial-3-temporal-RGB; layer counts 1,792 vs 5,248 at K=64) is worked in both the body and practice.

Where this fits

This is lesson 9 of 16, the fifth and final lesson of Phase 2 (How machines see). It depends on lesson 5 (the conv layer, which 3D conv generalizes) and lesson 7 (sequence tools for vision; the CNN+RNN and video-transformer families build on what was covered there). The next lesson, Learning from images without labels: self-supervised vision, opens Phase 3 (Generating and grounding vision) by asking what a model can learn from unlabelled images.

Before you start

Prerequisites: lesson 7 of this track (sequence tools for vision). The CNN+RNN video architecture is exactly the per-frame-CNN-features-into-RNN shape from L7, applied to video; video transformers are ViT-style architectures from L7 extended with a temporal dimension.

About the math

Light. The body cites parameter counts and architectural comparisons but does only one numerical computation: the 2D vs 3D convolution parameter ratio at 3x3 spatial + 3 temporal on RGB input (2D = 27 weights, 3D = 81 weights, ~3x). Practice repeats with K = 64 filters per layer (2D layer = 1,792 params, 3D layer = 5,248 params; ratio ~2.93x). No calculus; multiplication, addition, division.

By the end, you’ll be able to

State the video tensor shape and explain why temporal sampling is essential
Explain the single-frame-baseline discipline and what failing it tells you
Walk the architecture ladder and place each landmark on it
Compute and compare 2D vs 3D conv parameter counts
Recognize that the training loop is unchanged across video architectures

Time and difficulty

Read time: about 13 minutes
Practice time: about 15 minutes (a 2D-vs-3D-conv parameter-count comparison at K=64, an architecture-matching exercise across the ladder, a single-frame-baseline-defence reasoning question, plus flashcards)
Difficulty: standard (the math is multiplication and ratio; the conceptual lift is holding the architecture ladder in mind and seeing why the single-frame baseline is the experimental hygiene check, not a curiosity)