Summary: 3D vision

Cameras project a 3D scene to a 2D image; depth is lost in that projection, and 3D vision is the family of methods that recovers it. Depth cues a model can use: stereo disparity (two cameras), monocular cues (perspective, occlusion, shading, motion, ML priors from training), temporal cues (camera motion through a scene). Standard recovery formula for binocular stereo: Z = (f · b) / d (depth = focal length × baseline / disparity). 3D representations vary by task: depth maps, point clouds, voxels, meshes, implicit / SDFs, and the recent NeRF and 3D Gaussian Splatting scene representations. Methods include monocular depth (MiDaS), multi-view stereo, Structure from Motion (COLMAP), NeRF (novel view synthesis via learned MLP), and 3D Gaussian Splatting (current real-time SOTA for novel views).

Core ideas

The projection problem. A camera projects 3D → 2D and loses one dimension. Real interaction (self-driving, AR, robotics) needs that depth back; image-based recovery uses stereo, monocular cues (perspective, occlusion, shading, familiar size, texture), or motion (camera-motion-induced parallax).
Stereo formula. Z = (f · b) / d. Worked: f = 500 pixels, b = 10 cm, d = 50 pixels → Z = 100 cm. Closer points give larger disparity; infinite points give zero. Practice extends: f = 800, b = 12, d = 40 → 240 cm; d = 10 → 960 cm. Inverse relationship.
3D representations. Depth map (per-pixel depth; gateway concept). Point cloud (set of 3D points; LIDAR/RGB-D/SfM output). Voxels (3D grid; cubic memory). Mesh (vertices + faces; graphics standard). Implicit/SDF (function from 3D point → occupancy or distance). NeRF/3DGS (learned scene representations for novel view synthesis).
Methods. Monocular depth (MiDaS and successors; train a network to read monocular cues). MVS (known-pose images → dense 3D). SfM (unknown-pose photos → jointly recover poses AND 3D structure; COLMAP). NeRF (Mildenhall 2020; MLP maps (x,y,z,view) to (color,density); render via volumetric ray-integration; novel view synthesis sweet spot). 3D Gaussian Splatting (Kerbl 2023; 3D Gaussians + efficient rasterization; faster than NeRF, comparable quality, current real-time SOTA).
Applications. Self-driving 3D perception, AR/VR scene understanding, robotics manipulation/navigation, photogrammetry (mobile 3D scanning, real-estate, archaeology), movie/game production (digital doubles, environment capture, pre-vis), medical 3D reconstruction. The hardware-to-software shift (LIDAR-required → image-based) is what proliferated 3D vision from labs to phones in the last decade.

What changes for you

When a phone AR app places a virtual lamp on the floor in real-time, that is monocular depth + plane detection on-device. When a 3D-scanning app turns photos of a sculpture into a printable mesh, that is SfM + MVS + mesh extraction. When a self-driving display shows 3D bounding boxes around pedestrians, that is camera-based or LIDAR-based 3D detection. When a “render this scene from a different angle” demo appears, it is almost certainly NeRF or 3D Gaussian Splatting. The right tool for a 3D task depends on what you have (one image, two images, many images, video, LIDAR) and what you need (per-pixel depth, a 3D model, novel views). The economic point: dense 3D capture used to need specialized hardware; the last decade’s progress made it doable from ordinary cameras, which is why 3D-vision applications now appear in consumer products rather than only specialized installations.

Vision lost the third dimension when the camera projected; the algorithms in this lesson are the standard ways of recovering it.