Summary: Detection, segmentation, and visualization

Classification asks “what is in this image?” Real-world vision often needs more. Detection asks “what AND where?” producing a list of (class, bounding box) per object. Segmentation asks “which pixels belong to which object?” labeling every pixel. Visualization asks “what is the network actually looking at?” producing heatmaps and embeddings to peek inside what a trained CNN has learned. This lesson covers all three at the level needed for production decisions: which architectures, which metrics, what each technique can and cannot tell you.

Core ideas

Object detection outputs lists of (class label, bounding box), not single labels. Two families: two-stage (R-CNN → Fast R-CNN → Faster R-CNN; propose regions then classify; historically accurate, slower) and one-stage (YOLO, SSD, RetinaNet; single forward pass over a dense grid; historically faster). Both use anchor boxes (predict offsets from pre-defined box shapes), train with classification + box-regression loss, and evaluate with IoU (intersection-over-union), summarized as mean Average Precision (mAP).
IoU = intersection area / union area, between 0 (no overlap) and 1 (perfect). Standard match threshold is 0.5; stricter benchmarks (0.75, or averaged across thresholds) exist because 0.5 is loose. Worked in the body: predicted box (0,0,10,10) and ground-truth (1,1,11,11) → intersection 81, union 119, IoU ≈ 0.681 (a match).
Semantic segmentation labels each pixel with a class but ignores instances (three cats all get “cat” pixels). Architectures: FCN (replace FC layers with conv so output is a spatial map), U-Net (encoder-decoder with skip connections; the medical-imaging standard).
Instance segmentation labels each pixel with class AND instance ID (three cats produce three separate masks). Architecture: Mask R-CNN (Faster R-CNN + per-region mask head; the standard).
Visualization techniques peek inside trained CNNs. Saliency (gradient of class score wrt input pixels, cheap). Occlusion sensitivity (slide a mask, watch the prediction drop; slow but faithful). Grad-CAM (class-weighted feature maps; the most-used “where is the model looking” technique). DeepDream / activation maximization (gradient-ascend an input to maximize a neuron). t-SNE / UMAP (project deep features to 2D to see how the network clusters classes). All useful for intuition and debugging; none is a complete explanation. XAI is an active research area; treat visualizations as a useful first look, not as ground truth.
The training loop is unchanged. Each task defines its own loss (classification + box regression for detection; per-pixel cross-entropy for segmentation; combinations for instance segmentation). Lesson 3’s gradient descent + lesson 4’s backprop run on top of any architecture.

What changes for you

You see all three extensions in production. Detection powers autonomous-vehicle perception, security camera analytics, retail shopper tracking, face-detection in phone cameras, and document-OCR pre-processing. Segmentation powers medical imaging (tumor boundary measurement), photo background-removal in consumer apps, AR features (virtual try-on, ground detection), and per-pixel scene parsing in driving stacks. Visualization powers debugging (why did the model fail on this image?), fairness analysis (is the model relying on spurious features that correlate with protected attributes?), and AI-safety research. If you build or deploy vision, the IoU formula is the metric you will compute; the Grad-CAM map is the first thing you will produce to investigate a failed prediction; the choice between Faster R-CNN, YOLO, or Mask R-CNN is the pragmatic decision you make based on latency, accuracy, and whether you need per-instance masks. The wolf-vs-husky-snow-background example is the canonical warning that visualization can catch spurious-feature reliance that test-set accuracy alone misses.

Three sentences capture it: detection adds “where” to “what”; segmentation refines “where” to per-pixel; visualization adds “why” (partially, honestly) to “what.” All three sit on top of the conv architectures of lessons 5 and 6, with the training loop unchanged.