3D vision: cheatsheet

The core problem

Reality	Camera output	Need to recover
3D world (X, Y, Z)	2D image (u, v, color)	Depth Z per pixel (or richer 3D structure)

Depth cues a model can use

Cue	Source
Stereo disparity	Two cameras with known baseline; same point shifts position between views
Monocular: perspective	Parallel lines converge with distance; objects shrink
Monocular: occlusion	Closer objects block farther
Monocular: shading/shadows	Surface orientation produces predictable lighting
Monocular: familiar size	Known object size at known distance
Motion	Camera-motion-induced parallax over time
Active sensing (out of scope)	LIDAR, structured light, time-of-flight

Stereo depth formula

Z = (f · b) / d

Symbol	Meaning
Z	Depth (distance to camera plane)
f	Focal length (in pixel units)
b	Baseline (distance between cameras)
d	Disparity (pixel difference in image position)

Relationship: larger disparity = closer point; infinite point → zero disparity.

Worked stereo

Source	f	b	d	Z
Body	500 pixels	10 cm	50 px	100 cm
Practice 1	800 pixels	12 cm	40 px	240 cm
Practice 2	800 pixels	12 cm	10 px	960 cm

Same f·b product = 9600 (cm·pixels). Z scales inversely with d.

3D representations

Representation	Detail	Best for
Depth map	Per-pixel depth, same shape as image	First-stage; AR distance estimation
Point cloud	Set of 3D points (often + color)	LIDAR / RGB-D output; SfM intermediate
Voxels	3D grid (occupied/empty or values)	3D conv operations; cubic memory limits
Mesh	Vertices + faces (triangles)	Graphics, games, 3D printing
Implicit / SDF	Function 3D point → occupancy or signed distance	Flexible modern; surface as zero level set
NeRF	MLP `(x,y,z,view direction) → (color, density)`	Novel view synthesis from photos
3D Gaussian Splatting	Collection of 3D Gaussians; rasterize	Real-time novel view synthesis (current SOTA)

Standard methods

Method	Input	Output
Monocular depth (MiDaS, Depth Anything)	Single image	Depth map
Multi-view stereo (MVS)	Multiple images, known poses	Dense 3D reconstruction
Structure from Motion (COLMAP)	Many photos, unknown poses	Camera poses + sparse 3D points
NeRF (Mildenhall 2020)	Many photos + known poses	Implicit scene model; novel view synthesis
3D Gaussian Splatting (Kerbl 2023)	Many photos + known poses	Explicit Gaussian-set scene model; real-time novel views

Application map

Application	Method
Self-driving (camera-based)	Monocular depth + 3D detection; bird’s-eye-view perception
AR / VR	Real-time monocular depth + plane detection; SLAM for camera tracking
Robotics manipulation	Stereo or RGB-D point cloud + grasp planning
Photogrammetry / phone 3D scanning	SfM + MVS + mesh
Real-estate / e-commerce 3D	NeRF / 3DGS for novel views; photogrammetry for explicit models
Movie / game digital doubles	Multi-camera photogrammetry; NeRF for environments

Pitfalls

Pitfall	Reality
Treating 3D vision as a single problem	Family of problems (depth, MVS, novel view, 3D detection); different architectures, eval, trade-offs
Picking the wrong representation	Depth maps can’t show backsides; meshes hard to learn; point clouds lack surface; NeRF slow for explicit geometry. Match representation to task
Disparity = depth directly	Inversely related; larger disparity = closer point
NeRF replaces everything	Great at novel view synthesis; not always best for explicit geometry, fast queries, or game-engine integration

One-line takeaway

The camera lost the third dimension when it projected; 3D vision recovers it using stereo geometry, monocular cues, or motion, with method and representation chosen by what you have (one image, many images, video, LIDAR) and what you need (per-pixel depth, 3D model, novel views).