Detection, segmentation: cheatsheet

The three task families

Task	Output	Question answered
Classification	One class label per image	What is in this image?
Detection	List of (class, bounding box) per image	What AND where?
Semantic segmentation	Class label per pixel	Which pixels belong to which class?
Instance segmentation	Class + instance ID per pixel	Which pixels belong to which specific object?

Detection architectures

Family	Examples	Idea	Trade-off
Two-stage	R-CNN, Fast R-CNN, Faster R-CNN	Propose regions then classify each	Historically more accurate, slower
One-stage	YOLO, SSD, RetinaNet	Single forward pass over a dense grid	Faster (often real-time), historically slightly less accurate

Both use anchor boxes (pre-defined shapes; network predicts offsets), classification + box-regression loss, IoU + mAP evaluation.

IoU (Intersection over Union)

Step	Formula
Intersection x1	`max(A.x1, B.x1)`
Intersection y1	`max(A.y1, B.y1)`
Intersection x2	`min(A.x2, B.x2)`
Intersection y2	`min(A.y2, B.y2)`
Intersection area	`(x2 - x1) * (y2 - y1)` (if both positive; else 0)
Union area	`area(A) + area(B) - intersection area`
IoU	`intersection area / union area`

Match threshold: typically 0.5 (loose); stricter benchmarks use 0.75 or sweep thresholds for mAP.

Worked IoU (body)

Box	Coords	Area
A (predicted)	(0, 0, 10, 10)	100
B (ground truth)	(1, 1, 11, 11)	100
Intersection	(1, 1, 10, 10)	81
Union		100 + 100 - 81 = 119
IoU		81 / 119 ≈ 0.681 (match at 0.5 threshold)

Segmentation architectures

Architecture	What it does	When to use
FCN (Long 2015)	Replace FC tower with conv so output is spatial; semantic segmentation	First-deep approach; baseline
U-Net (Ronneberger 2015)	Encoder-decoder with skip connections from each encoder layer to its decoder counterpart	Medical imaging standard; clean spatial detail
Mask R-CNN (He 2017)	Faster R-CNN + per-region binary mask head; instance segmentation	When you need per-instance per-pixel masks

Visualization techniques

Technique	What it does	Cost
Saliency	Gradient of class score wrt input pixels	Cheap (one backward pass)
Occlusion sensitivity	Slide a mask, record where prediction drops	Slow (many forward passes); most faithful
Grad-CAM	Class-weighted feature maps near top of network	Cheap; works on any CNN; most-used in practice
Activation max / DeepDream	Gradient-ascend an input to max a neuron’s activation	Moderate; striking visuals
t-SNE / UMAP	Project deep features to 2D	Moderate; per-many-images

All useful for intuition + debugging; NONE is a complete explanation. XAI is active research.

What does NOT change

What	Why
Training loop	L3 loss + L4 backprop run over any architecture
Gradient descent step	Unchanged; just a new loss + new architecture per task
Backprop	Carries gradients through detection heads, segmentation heads, mask heads identically

Pitfalls

Pitfall	Reality
IoU 0.5 = pass, IoU 0.49 = fail (sharp)	IoU is continuous; 0.5 is convention; visually-fine predictions can fall below
Semantic = instance segmentation	Semantic: per-pixel class only. Instance: + per-instance ID. Different output, different architecture
Saliency / Grad-CAM = ground truth on why	Local approximations; useful first looks, not proof
One-stage always faster than two-stage	Usually, but depends on backbone, image size, framework, hardware

One-line takeaway

Detection adds “where” to “what” (list of class+box per image, IoU evaluation); segmentation refines “where” to per-pixel (semantic vs instance, U-Net / Mask R-CNN); visualization adds partial “why” (saliency / Grad-CAM / t-SNE) as a debugging tool, not a complete explanation.