Skip to content

Beyond what is it, detection, segmentation, and seeing inside the net

This is lesson 8 of Phase 2 (How machines see). The one capability it builds: you will be able to distinguish the three task families that extend classification (detection, segmentation, visualization), compute the standard detection-evaluation metric (IoU) by hand, recognize the canonical architecture for each task on sight, and reason about when each visualization technique is the right diagnostic. The source curriculum is Stanford CS231n, cs231n.stanford.edu; this lesson maps to Lecture 9 (Object Detection, Image Segmentation, Visualizing and Understanding).

The lesson walks object detection (R-CNN family vs YOLO family; anchor boxes; classification + box-regression loss; IoU + mAP), works one IoU by hand, walks image segmentation (semantic via FCN / U-Net; instance via Mask R-CNN), then surveys visualization techniques (saliency, occlusion sensitivity, Grad-CAM, DeepDream, t-SNE) with an honest caveat on what they can and cannot tell you. The training loop is unchanged across all three; what changes per task is the loss and the architecture’s output head.

This is lesson 8 of 16, the fourth lesson of Phase 2. It depends on lessons 5 and 6 (the conv layer and the CNN architectures these task-specific heads sit on top of). The next lesson, Teaching machines to understand video, extends vision from single images to motion across time, the natural follow-on once spatial tasks are covered.

Prerequisites: lesson 6 of this track (CNN architectures). The detection and segmentation architectures use the conv backbones from L6 as their feature extractors; you need that picture in mind. Lessons 3-4 (loss + gradient descent + backprop) carry over unchanged.

Light. The body works one IoU computation by hand using IoU = intersection area / union area on a 2D box pair (predicted (0,0,10,10) vs ground-truth (1,1,11,11) → 81/119 ≈ 0.681). Practice repeats the calculation with fresh boxes (predicted (2,2,12,12) vs ground-truth (5,5,15,15) → 49/151 ≈ 0.325). No calculus; addition, subtraction, multiplication, division.

  • Distinguish the four task families (classification, detection, semantic segmentation, instance segmentation) by output and by question
  • Compute IoU by hand and decide match-or-not at threshold 0.5
  • Name detection’s two architectural families and their trade-off
  • Name segmentation’s standard architectures (FCN, U-Net, Mask R-CNN) and what each is best at
  • Survey visualization techniques and state the XAI honest caveat
  • Read time: about 14 minutes
  • Practice time: about 15 minutes (a fresh IoU computation, an architecture-matching exercise across detection + segmentation + visualization, a visualization-investigation planning question, plus flashcards)
  • Difficulty: standard (the math is integer arithmetic for IoU; the conceptual lift is holding three task families and their architecture-and-evaluation pieces in mind at once)