Skip to content

Cheatsheet: Video understanding

PropertyDetail
ShapeT × H × W × C (time × height × width × channels)
Example: 64-frame 224x224 RGB clip64 · 224 · 224 · 3 ≈ 9.6 million numbers
Real-world: 1-minute 30fps clip~270 million numbers
Why temporal sampling existsRaw frame counts are too large; default 32-64 frames per training example
StrategyDetail
Uniform strideTake every k-th frame
Random temporal cropsAt training, sample a contiguous chunk (e.g. 64 frames) from a random position
Multi-chunk averagingAt evaluation, average predictions over multiple chunks
Variable rates per streamSlowFast: one pathway slow (high spatial detail), one fast (high temporal detail)

Architecture ladder (in increasing temporal-modelling power)

Section titled “Architecture ladder (in increasing temporal-modelling power)”
ApproachWhat it doesWhen to use
Single-frame baseline2D classifier on one frame; ignore restAlways: the floor any video model must beat
Late fusionPer-frame 2D CNN features averaged; classifyContent matters, ORDER doesn’t
Early fusionStack T frames as input channels before first convFirst layer sees temporal info; deeper layers do not
3D convFilter slides spatially AND temporally; ~3x param cost per filterWhen motion at multiple scales matters; C3D, I3D
Two-streamRGB appearance stream + optical-flow motion stream; fuse lateWhen motion is the discriminative signal; SlowFast for modern descendant
CNN + RNNPer-frame 2D CNN features into a sequence modelWhen output is itself a sequence (video captioning); see lesson 7
Video transformerSpatio-temporal patches into a transformer encoder; factorized attentionMost-scalable; TimeSformer, ViViT
LayerPer-filter weightsPer-layer params at K=64 (RGB input)
2D conv, 3x3 spatial3·3·3 = 2764·27 + 64 = 1,792
3D conv, 3 temporal × 3·3 spatial3·3·3·3 = 8164·81 + 64 = 5,248
Ratio3x more weights per filter~2.93x more layer params

Compute per output position scales similarly.

ArchitectureYearHeadline idea
C3D2015Early influential 3D-conv architecture for video
I3D2017”Inflate” pretrained 2D ImageNet weights into 3D filters; bootstrap a 3D model from a strong 2D one
Two-stream (Simonyan & Zisserman)2014Separate RGB-appearance stream and optical-flow-motion stream, fuse late
SlowFast2019Modern two-stream descendant; two pathways at different temporal rates, fused throughout
TimeSformer2021Video transformer with divided space-time attention
ViViT2021Video transformer with factorized encoder / attention / dot-product variants
DatasetCoverage
Kinetics-400/600/700Modern action-recognition standard (hundreds of thousands of YouTube clips, hundreds of action classes)
CharadesLonger multi-action everyday videos
AVATemporal action localization
Something-SomethingFine-grained physical interactions
HowTo100MInstructional videos at scale
WhatWhy
LossPer-clip classification (cross-entropy) or per-step output; same machinery as L3
Gradient descent stepUnchanged
BackpropCarries gradients through 3D convs, RNNs, and transformer blocks identically
PitfallReality
Skip the single-frame baselineAlways run it; many benchmarks are largely solvable from one frame; failing to beat it means your video model is not temporal
Late fusion = temporal modellingNo; it captures content, not order. Cannot distinguish “reach then grasp” from the reverse
Optical flow is freeIt must be computed beforehand; modern systems often skip it (learn motion directly from raw frames)
Transformer always beats 3D-conv on videoSame trade-off as on images: more data + compute to outperform; 3D-conv stays competitive at moderate scale
Forgetting input scale4.8M-9.6M numbers per training example; memory and compute bills scale accordingly

Video adds the time dimension to a 2D vision pipeline; the standard answers (single-frame baseline → late/early fusion → 3D conv → two-stream → CNN+RNN → video transformer) all run the same Phase 1 training loop on top, and temporal sampling is the essential practical step you cannot skip.