References: 3D vision

Source material

This lesson follows Stanford CS231n’s treatment of 3D vision (Lecture 15).

Course: Stanford CS231n, “Deep Learning for Computer Vision”
Instructors: Fei-Fei Li, Ehsan Adeli, and Justin Johnson (Stanford University)
Course site: cs231n.stanford.edu
This lesson maps to: Lecture 15 (3D Vision).

Attribution (Clawdemy-authored): Stanford CS231n: Deep Learning for Computer Vision, Fei-Fei Li, Ehsan Adeli, and Justin Johnson, Stanford University (cs231n.stanford.edu). CS231n does not publish a required citation string; this is the attribution Clawdemy uses.

A note on access and license

The current term’s lecture recordings are posted on Canvas for enrolled Stanford students. Recordings from previous years are publicly available on YouTube under YouTube’s standard license; Clawdemy links out rather than embedding or rehosting. The course notes (cs231n.github.io) and site are Stanford’s. No Creative Commons license is published for the lectures, so we treat them as link-only references.

Primary papers (cited by name and venue)

Monocular depth

MiDaS. Ranftl, Lasinger, Hafner, Schindler, Koltun, “Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-shot Cross-dataset Transfer” (TPAMI 2020 / arXiv 2019). The popular general-purpose monocular depth model.
Depth Anything. Yang, Kang, Huang, Xu, Feng, Zhao, “Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data” (CVPR 2024). Foundation-scale monocular depth.
MonoDepth (self-supervised). Godard, Mac Aodha, Brostow, “Unsupervised Monocular Depth Estimation with Left-Right Consistency” (CVPR 2017); MonoDepth2 (ICCV 2019). Influential self-supervised monocular depth.

Structure from Motion and Multi-view Stereo

COLMAP. Schönberger, Frahm, “Structure-from-Motion Revisited” (CVPR 2016). The open-source SfM workhorse.
MVSNet. Yao, Luo, Li, Fang, Quan, “MVSNet: Depth Inference for Unstructured Multi-View Stereo” (ECCV 2018). Early influential deep MVS architecture.

Stereo

PSMNet. Chang, Chen, “Pyramid Stereo Matching Network” (CVPR 2018). Influential deep stereo network.
RAFT-Stereo. Lipson, Teed, Deng, “RAFT-Stereo: Multilevel Recurrent Field Transforms for Stereo Matching” (3DV 2021). Modern stereo network.

NeRF and successors

NeRF. Mildenhall, Srinivasan, Tancik, Barron, Ramamoorthi, Ng, “NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis” (ECCV 2020; ACM Communications 2021). The paper that triggered the novel-view-synthesis wave.
Instant NGP / Instant Neural Graphics Primitives. Müller, Evans, Schied, Keller, “Instant Neural Graphics Primitives with a Multiresolution Hash Encoding” (SIGGRAPH 2022). A NeRF variant that trains in seconds rather than hours.
3D Gaussian Splatting. Kerbl, Kopanas, Leimkühler, Drettakis, “3D Gaussian Splatting for Real-Time Radiance Field Rendering” (SIGGRAPH 2023). Best-paper award; the current real-time novel-view-synthesis state of the art.

3D point-cloud detection (for autonomous driving)

PointNet. Qi, Su, Mo, Guibas, “PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation” (CVPR 2017). The original deep architecture for raw point clouds.
PointPillars. Lang et al., “PointPillars: Fast Encoders for Object Detection from Point Clouds” (CVPR 2019). LIDAR-based 3D object detection workhorse.
CenterPoint. Yin, Zhou, Krähenbühl, “Center-based 3D Object Detection and Tracking” (CVPR 2021). Modern center-based 3D detector.

Further study

COLMAP open-source pipeline (github.com/colmap/colmap): the standard SfM + MVS tool; reproducible 3D reconstruction from a folder of photos.
Open3D (open3d.org): general-purpose 3D data processing library; point-cloud and mesh operations.
Nerfstudio (nerf.studio): a popular open-source toolkit for training NeRF-style models.
TorchVision and PyTorch3D: vision and 3D-specific deep learning libraries.

How we use this source

Clawdemy follows CS231n’s Lec 15 ordering (depth cues → 3D representations → standard methods) and cites the canonical papers by name and venue. The stereo depth formula Z = (f · b) / d is standard textbook material. The worked stereo examples (body: f = 500, b = 10cm, d = 50 → Z = 100cm; practice: f = 800, b = 12cm, d = 40 → 240cm and d = 10 → 960cm) are Clawdemy-authored against the standard formula. The representation-to-task mapping table and the scene-recovery method-choice exercise in practice are Clawdemy-authored to make the family-of-methods nature of 3D vision operational rather than abstract. We do not reproduce CS231n’s slides, figures, problem sets, or lecture text. Full attribution policy: see Doc/attribution-policy.md.