Skip to content

Cheatsheet: 3D vision

RealityCamera outputNeed to recover
3D world (X, Y, Z)2D image (u, v, color)Depth Z per pixel (or richer 3D structure)
CueSource
Stereo disparityTwo cameras with known baseline; same point shifts position between views
Monocular: perspectiveParallel lines converge with distance; objects shrink
Monocular: occlusionCloser objects block farther
Monocular: shading/shadowsSurface orientation produces predictable lighting
Monocular: familiar sizeKnown object size at known distance
MotionCamera-motion-induced parallax over time
Active sensing (out of scope)LIDAR, structured light, time-of-flight

Z = (f · b) / d

SymbolMeaning
ZDepth (distance to camera plane)
fFocal length (in pixel units)
bBaseline (distance between cameras)
dDisparity (pixel difference in image position)

Relationship: larger disparity = closer point; infinite point → zero disparity.

SourcefbdZ
Body500 pixels10 cm50 px100 cm
Practice 1800 pixels12 cm40 px240 cm
Practice 2800 pixels12 cm10 px960 cm

Same f·b product = 9600 (cm·pixels). Z scales inversely with d.

RepresentationDetailBest for
Depth mapPer-pixel depth, same shape as imageFirst-stage; AR distance estimation
Point cloudSet of 3D points (often + color)LIDAR / RGB-D output; SfM intermediate
Voxels3D grid (occupied/empty or values)3D conv operations; cubic memory limits
MeshVertices + faces (triangles)Graphics, games, 3D printing
Implicit / SDFFunction 3D point → occupancy or signed distanceFlexible modern; surface as zero level set
NeRFMLP (x,y,z,view direction) → (color, density)Novel view synthesis from photos
3D Gaussian SplattingCollection of 3D Gaussians; rasterizeReal-time novel view synthesis (current SOTA)
MethodInputOutput
Monocular depth (MiDaS, Depth Anything)Single imageDepth map
Multi-view stereo (MVS)Multiple images, known posesDense 3D reconstruction
Structure from Motion (COLMAP)Many photos, unknown posesCamera poses + sparse 3D points
NeRF (Mildenhall 2020)Many photos + known posesImplicit scene model; novel view synthesis
3D Gaussian Splatting (Kerbl 2023)Many photos + known posesExplicit Gaussian-set scene model; real-time novel views
ApplicationMethod
Self-driving (camera-based)Monocular depth + 3D detection; bird’s-eye-view perception
AR / VRReal-time monocular depth + plane detection; SLAM for camera tracking
Robotics manipulationStereo or RGB-D point cloud + grasp planning
Photogrammetry / phone 3D scanningSfM + MVS + mesh
Real-estate / e-commerce 3DNeRF / 3DGS for novel views; photogrammetry for explicit models
Movie / game digital doublesMulti-camera photogrammetry; NeRF for environments
PitfallReality
Treating 3D vision as a single problemFamily of problems (depth, MVS, novel view, 3D detection); different architectures, eval, trade-offs
Picking the wrong representationDepth maps can’t show backsides; meshes hard to learn; point clouds lack surface; NeRF slow for explicit geometry. Match representation to task
Disparity = depth directlyInversely related; larger disparity = closer point
NeRF replaces everythingGreat at novel view synthesis; not always best for explicit geometry, fast queries, or game-engine integration

The camera lost the third dimension when it projected; 3D vision recovers it using stereo geometry, monocular cues, or motion, with method and representation chosen by what you have (one image, many images, video, LIDAR) and what you need (per-pixel depth, 3D model, novel views).