Skip to content

Cheatsheet: Detection, segmentation, and visualization

TaskOutputQuestion answered
ClassificationOne class label per imageWhat is in this image?
DetectionList of (class, bounding box) per imageWhat AND where?
Semantic segmentationClass label per pixelWhich pixels belong to which class?
Instance segmentationClass + instance ID per pixelWhich pixels belong to which specific object?
FamilyExamplesIdeaTrade-off
Two-stageR-CNN, Fast R-CNN, Faster R-CNNPropose regions then classify eachHistorically more accurate, slower
One-stageYOLO, SSD, RetinaNetSingle forward pass over a dense gridFaster (often real-time), historically slightly less accurate

Both use anchor boxes (pre-defined shapes; network predicts offsets), classification + box-regression loss, IoU + mAP evaluation.

StepFormula
Intersection x1max(A.x1, B.x1)
Intersection y1max(A.y1, B.y1)
Intersection x2min(A.x2, B.x2)
Intersection y2min(A.y2, B.y2)
Intersection area(x2 - x1) * (y2 - y1) (if both positive; else 0)
Union areaarea(A) + area(B) - intersection area
IoUintersection area / union area

Match threshold: typically 0.5 (loose); stricter benchmarks use 0.75 or sweep thresholds for mAP.

BoxCoordsArea
A (predicted)(0, 0, 10, 10)100
B (ground truth)(1, 1, 11, 11)100
Intersection(1, 1, 10, 10)81
Union100 + 100 - 81 = 119
IoU81 / 119 ≈ 0.681 (match at 0.5 threshold)
ArchitectureWhat it doesWhen to use
FCN (Long 2015)Replace FC tower with conv so output is spatial; semantic segmentationFirst-deep approach; baseline
U-Net (Ronneberger 2015)Encoder-decoder with skip connections from each encoder layer to its decoder counterpartMedical imaging standard; clean spatial detail
Mask R-CNN (He 2017)Faster R-CNN + per-region binary mask head; instance segmentationWhen you need per-instance per-pixel masks
TechniqueWhat it doesCost
SaliencyGradient of class score wrt input pixelsCheap (one backward pass)
Occlusion sensitivitySlide a mask, record where prediction dropsSlow (many forward passes); most faithful
Grad-CAMClass-weighted feature maps near top of networkCheap; works on any CNN; most-used in practice
Activation max / DeepDreamGradient-ascend an input to max a neuron’s activationModerate; striking visuals
t-SNE / UMAPProject deep features to 2DModerate; per-many-images

All useful for intuition + debugging; NONE is a complete explanation. XAI is active research.

WhatWhy
Training loopL3 loss + L4 backprop run over any architecture
Gradient descent stepUnchanged; just a new loss + new architecture per task
BackpropCarries gradients through detection heads, segmentation heads, mask heads identically
PitfallReality
IoU 0.5 = pass, IoU 0.49 = fail (sharp)IoU is continuous; 0.5 is convention; visually-fine predictions can fall below
Semantic = instance segmentationSemantic: per-pixel class only. Instance: + per-instance ID. Different output, different architecture
Saliency / Grad-CAM = ground truth on whyLocal approximations; useful first looks, not proof
One-stage always faster than two-stageUsually, but depends on backbone, image size, framework, hardware

Detection adds “where” to “what” (list of class+box per image, IoU evaluation); segmentation refines “where” to per-pixel (semantic vs instance, U-Net / Mask R-CNN); visualization adds partial “why” (saliency / Grad-CAM / t-SNE) as a debugging tool, not a complete explanation.