References: Video understanding

Source material

This lesson follows Stanford CS231n’s treatment of video understanding (Lecture 10), covering the architecture ladder from single-frame baselines through video transformers.

Course: Stanford CS231n, “Deep Learning for Computer Vision”
Instructors: Fei-Fei Li, Ehsan Adeli, and Justin Johnson (Stanford University)
Course site: cs231n.stanford.edu
This lesson maps to: Lecture 10 (Video Understanding).

Attribution (Clawdemy-authored): Stanford CS231n: Deep Learning for Computer Vision, Fei-Fei Li, Ehsan Adeli, and Justin Johnson, Stanford University (cs231n.stanford.edu). CS231n does not publish a required citation string; this is the attribution Clawdemy uses.

A note on access and license

The current term’s lecture recordings are posted on Canvas for enrolled Stanford students. Recordings from previous years are publicly available on YouTube under YouTube’s standard license; Clawdemy links out rather than embedding or rehosting. The course notes (cs231n.github.io) and site are Stanford’s. No Creative Commons license is published for the lectures, so we treat them as link-only references.

Primary architecture papers (cited by name and venue)

3D convolutions for video

C3D. Tran, Bourdev, Fergus, Torresani, Paluri, “Learning Spatiotemporal Features with 3D Convolutional Networks” (ICCV 2015). Influential early 3D-conv architecture for video.
I3D. Carreira, Zisserman, “Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset” (CVPR 2017). “Inflated” 3D ConvNet that bootstraps 3D filters from pretrained 2D ImageNet weights. Introduced the Kinetics dataset.

Two-stream and descendants

Two-stream (original). Simonyan, Zisserman, “Two-Stream Convolutional Networks for Action Recognition in Videos” (NeurIPS 2014). The architecture that set the action-recognition standard for several years.
SlowFast. Feichtenhofer, Fan, Malik, He, “SlowFast Networks for Video Recognition” (ICCV 2019). Modern two-pathway descendant with different temporal sampling rates.

Video transformers

TimeSformer. Bertasius, Wang, Torresani, “Is Space-Time Attention All You Need for Video Understanding?” (ICML 2021). Divided space-time attention as a practical factorization.
ViViT. Arnab, Dehghani, Heigold, Sun, Lucic, Schmid, “ViViT: A Video Vision Transformer” (ICCV 2021). Several factorized-encoder and factorized-attention variants.

Datasets

Kinetics-400. Kay, Carreira, Simonyan, Zhang, Hillier, Vijayanarasimhan, Viola, Green, Back, Natsev, Suleyman, Zisserman, “The Kinetics Human Action Video Dataset” (arXiv 2017). Foundational large-scale action-recognition benchmark; later extended to Kinetics-600 and Kinetics-700.
HowTo100M. Miech, Zhukov, Alayrac, Tapaswi, Laptev, Sivic, “HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips” (ICCV 2019). Instructional-video dataset at scale.
AVA. Gu et al., “AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions” (CVPR 2018). Standard for temporal action localization.
Something-Something. Goyal et al., “The ‘Something Something’ Video Database for Learning and Evaluating Visual Common Sense” (ICCV 2017). Fine-grained physical interactions, designed to require actual temporal reasoning (single-frame baseline is weak here, which is the dataset’s point).

Further study

TorchVision video module (PyTorch) and mmaction2 (OpenMMLab) implement most of the architectures above with consistent APIs; recommended for actually running any of them.
Optical flow algorithms that feed two-stream networks: Brox et al. (2004), TV-L1, and modern learned variants like FlowNet (Dosovitskiy et al. 2015) and RAFT (Teed and Deng 2020).
Introduction to Deep Learning (Track 12, Clawdemy). Lesson 2 (Why sequences need memory) covers the generic recurrence machinery that CNN-plus-RNN video architectures use.

How we use this source

Clawdemy follows CS231n’s Lec 10 architectural ladder (single-frame baseline → late/early fusion → 3D conv → two-stream → CNN+RNN → video transformer) and cites the canonical architecture papers above by name and venue. The 4D video tensor framing (T × H × W × C) is standard across the field. The parameter-count comparison in the body and practice (2D 3·3·3 = 27 vs 3D 3·3·3·3 = 81 per filter; 1,792 vs 5,248 layer params at K=64) is Clawdemy-authored against the standard conv parameter-count formula from lesson 5. The “always run the single-frame baseline” discipline reflects the practitioner consensus on video benchmark interpretation, made explicit here as common-pitfall guidance. We do not reproduce CS231n’s slides, figures, problem sets, or lecture text. Full attribution policy: see Doc/attribution-policy.md.