| Reality | Camera output | Need to recover |
|---|
| 3D world (X, Y, Z) | 2D image (u, v, color) | Depth Z per pixel (or richer 3D structure) |
| Cue | Source |
|---|
| Stereo disparity | Two cameras with known baseline; same point shifts position between views |
| Monocular: perspective | Parallel lines converge with distance; objects shrink |
| Monocular: occlusion | Closer objects block farther |
| Monocular: shading/shadows | Surface orientation produces predictable lighting |
| Monocular: familiar size | Known object size at known distance |
| Motion | Camera-motion-induced parallax over time |
| Active sensing (out of scope) | LIDAR, structured light, time-of-flight |
Z = (f · b) / d
| Symbol | Meaning |
|---|
| Z | Depth (distance to camera plane) |
| f | Focal length (in pixel units) |
| b | Baseline (distance between cameras) |
| d | Disparity (pixel difference in image position) |
Relationship: larger disparity = closer point; infinite point → zero disparity.
| Source | f | b | d | Z |
|---|
| Body | 500 pixels | 10 cm | 50 px | 100 cm |
| Practice 1 | 800 pixels | 12 cm | 40 px | 240 cm |
| Practice 2 | 800 pixels | 12 cm | 10 px | 960 cm |
Same f·b product = 9600 (cm·pixels). Z scales inversely with d.
| Representation | Detail | Best for |
|---|
| Depth map | Per-pixel depth, same shape as image | First-stage; AR distance estimation |
| Point cloud | Set of 3D points (often + color) | LIDAR / RGB-D output; SfM intermediate |
| Voxels | 3D grid (occupied/empty or values) | 3D conv operations; cubic memory limits |
| Mesh | Vertices + faces (triangles) | Graphics, games, 3D printing |
| Implicit / SDF | Function 3D point → occupancy or signed distance | Flexible modern; surface as zero level set |
| NeRF | MLP (x,y,z,view direction) → (color, density) | Novel view synthesis from photos |
| 3D Gaussian Splatting | Collection of 3D Gaussians; rasterize | Real-time novel view synthesis (current SOTA) |
| Method | Input | Output |
|---|
| Monocular depth (MiDaS, Depth Anything) | Single image | Depth map |
| Multi-view stereo (MVS) | Multiple images, known poses | Dense 3D reconstruction |
| Structure from Motion (COLMAP) | Many photos, unknown poses | Camera poses + sparse 3D points |
| NeRF (Mildenhall 2020) | Many photos + known poses | Implicit scene model; novel view synthesis |
| 3D Gaussian Splatting (Kerbl 2023) | Many photos + known poses | Explicit Gaussian-set scene model; real-time novel views |
| Application | Method |
|---|
| Self-driving (camera-based) | Monocular depth + 3D detection; bird’s-eye-view perception |
| AR / VR | Real-time monocular depth + plane detection; SLAM for camera tracking |
| Robotics manipulation | Stereo or RGB-D point cloud + grasp planning |
| Photogrammetry / phone 3D scanning | SfM + MVS + mesh |
| Real-estate / e-commerce 3D | NeRF / 3DGS for novel views; photogrammetry for explicit models |
| Movie / game digital doubles | Multi-camera photogrammetry; NeRF for environments |
| Pitfall | Reality |
|---|
| Treating 3D vision as a single problem | Family of problems (depth, MVS, novel view, 3D detection); different architectures, eval, trade-offs |
| Picking the wrong representation | Depth maps can’t show backsides; meshes hard to learn; point clouds lack surface; NeRF slow for explicit geometry. Match representation to task |
| Disparity = depth directly | Inversely related; larger disparity = closer point |
| NeRF replaces everything | Great at novel view synthesis; not always best for explicit geometry, fast queries, or game-engine integration |
The camera lost the third dimension when it projected; 3D vision recovers it using stereo geometry, monocular cues, or motion, with method and representation chosen by what you have (one image, many images, video, LIDAR) and what you need (per-pixel depth, 3D model, novel views).