Skip to content

Teaching machines to understand video

This is lesson 9 of Phase 2 (How machines see), and the last lesson before we leave recognition behind for generation in Phase 3. The one capability it builds: you will be able to walk the standard architecture ladder for video (single-frame baseline through video transformer), place each landmark architecture (C3D, I3D, two-stream, SlowFast, TimeSformer, ViViT) on it, compute the 2D-vs-3D-conv parameter-cost ratio, and reason about temporal sampling and the single-frame-baseline discipline. The source curriculum is Stanford CS231n, cs231n.stanford.edu; this lesson maps to Lecture 10 (Video Understanding).

The lesson opens with what a video actually is numerically (4D tensor T × H × W × C, modest clips at millions of numbers, real clips at hundreds of millions, hence the need for temporal sampling). It walks the architecture ladder from the simplest (single-frame, late fusion, early fusion) through 3D convolutions and two-stream networks, cross-links the CNN+RNN family to lesson 7, and ends with modern video transformers (TimeSformer’s divided attention, ViViT’s factorizations). The parameter-count comparison (2D 27 vs 3D 81 weights per filter at 3-spatial-3-temporal-RGB; layer counts 1,792 vs 5,248 at K=64) is worked in both the body and practice.

This is lesson 9 of 16, the fifth and final lesson of Phase 2 (How machines see). It depends on lesson 5 (the conv layer, which 3D conv generalizes) and lesson 7 (sequence tools for vision; the CNN+RNN and video-transformer families build on what was covered there). The next lesson, Learning from images without labels: self-supervised vision, opens Phase 3 (Generating and grounding vision) by asking what a model can learn from unlabelled images.

Prerequisites: lesson 7 of this track (sequence tools for vision). The CNN+RNN video architecture is exactly the per-frame-CNN-features-into-RNN shape from L7, applied to video; video transformers are ViT-style architectures from L7 extended with a temporal dimension.

Light. The body cites parameter counts and architectural comparisons but does only one numerical computation: the 2D vs 3D convolution parameter ratio at 3x3 spatial + 3 temporal on RGB input (2D = 27 weights, 3D = 81 weights, ~3x). Practice repeats with K = 64 filters per layer (2D layer = 1,792 params, 3D layer = 5,248 params; ratio ~2.93x). No calculus; multiplication, addition, division.

  • State the video tensor shape and explain why temporal sampling is essential
  • Explain the single-frame-baseline discipline and what failing it tells you
  • Walk the architecture ladder and place each landmark on it
  • Compute and compare 2D vs 3D conv parameter counts
  • Recognize that the training loop is unchanged across video architectures
  • Read time: about 13 minutes
  • Practice time: about 15 minutes (a 2D-vs-3D-conv parameter-count comparison at K=64, an architecture-matching exercise across the ladder, a single-frame-baseline-defence reasoning question, plus flashcards)
  • Difficulty: standard (the math is multiplication and ratio; the conceptual lift is holding the architecture ladder in mind and seeing why the single-frame baseline is the experimental hygiene check, not a curiosity)