Recovering 3D vision, in brief

What you’ll learn

This is lesson 13 of Phase 3 (Generating and grounding vision). The one capability it builds: you will be able to explain how vision systems recover 3D structure from 2D images, compute depth from stereo disparity by hand, distinguish the main 3D representations, and match method to task across monocular depth, multi-view stereo, Structure from Motion, NeRF, and 3D Gaussian Splatting. The source curriculum is Stanford CS231n, cs231n.stanford.edu; this lesson maps to Lecture 15 (3D Vision).

The lesson opens with the projection problem (cameras collapse 3D into 2D and lose depth), names the depth cues (stereo, monocular, motion), gives the stereo formula Z = (f · b) / d with one worked example, surveys the 3D representations (depth map, point cloud, voxels, mesh, implicit / SDF, NeRF, 3D Gaussian Splatting), walks the standard methods (monocular depth like MiDaS; multi-view stereo; Structure from Motion via COLMAP; NeRF; 3D Gaussian Splatting), and ends with application use cases (self-driving, AR / VR, robotics, photogrammetry, novel view synthesis).

Where this fits

This is lesson 13 of 16, the fourth lesson of Phase 3. It depends on lesson 6 (CNN architectures: most 3D-vision models use ResNet-family or ViT backbones). The next lesson, Connecting pictures and words: vision and language, returns to 2D images but adds the language modality (CLIP, captioning, visual question answering, the vision-language foundation models that power multimodal AI).

Before you start

Prerequisites: lesson 6 of this track (CNN architectures). 3D-vision methods sit on standard vision backbones; you need that picture in mind. Earlier lessons on convolution (L5), self-supervised learning (L10, since NeRF and similar use learned representations of scenes), and the generative-modeling stretch (L11-L12, since NeRF training has reconstruction-loss flavour) are all useful background but not strictly required.

About the math

Light. The body works one stereo-depth calculation by hand using Z = (f · b) / d with f = 500 pixels, b = 10 cm, d = 50 pixels → Z = 100 cm. Practice repeats with two fresh disparity values (d = 40 → Z = 240 cm; d = 10 → Z = 960 cm) to land the inverse-disparity-vs-depth relationship. No calculus; multiplication and division.

By the end, you’ll be able to

Explain why 3D vision exists and what depth cues vision systems use
Compute stereo depth from disparity by hand
Distinguish the main 3D representations and pick the right one for a task
Match the right method (monocular depth, MVS, SfM, NeRF, 3DGS) to a given 3D-vision problem

Time and difficulty

Read time: about 13 minutes
Practice time: about 15 minutes (a fresh stereo-depth computation at two disparity values, a representation-choice exercise across 4 tasks, a scene-recovery method-choice planning question, plus flashcards)
Difficulty: standard (the math is multiplication and division; the conceptual lift is holding the family-of-methods nature of 3D vision in mind and knowing how to choose among methods by what input you have)