| Task | Output | Question answered |
|---|
| Classification | One class label per image | What is in this image? |
| Detection | List of (class, bounding box) per image | What AND where? |
| Semantic segmentation | Class label per pixel | Which pixels belong to which class? |
| Instance segmentation | Class + instance ID per pixel | Which pixels belong to which specific object? |
| Family | Examples | Idea | Trade-off |
|---|
| Two-stage | R-CNN, Fast R-CNN, Faster R-CNN | Propose regions then classify each | Historically more accurate, slower |
| One-stage | YOLO, SSD, RetinaNet | Single forward pass over a dense grid | Faster (often real-time), historically slightly less accurate |
Both use anchor boxes (pre-defined shapes; network predicts offsets), classification + box-regression loss, IoU + mAP evaluation.
| Step | Formula |
|---|
| Intersection x1 | max(A.x1, B.x1) |
| Intersection y1 | max(A.y1, B.y1) |
| Intersection x2 | min(A.x2, B.x2) |
| Intersection y2 | min(A.y2, B.y2) |
| Intersection area | (x2 - x1) * (y2 - y1) (if both positive; else 0) |
| Union area | area(A) + area(B) - intersection area |
| IoU | intersection area / union area |
Match threshold: typically 0.5 (loose); stricter benchmarks use 0.75 or sweep thresholds for mAP.
| Box | Coords | Area |
|---|
| A (predicted) | (0, 0, 10, 10) | 100 |
| B (ground truth) | (1, 1, 11, 11) | 100 |
| Intersection | (1, 1, 10, 10) | 81 |
| Union | | 100 + 100 - 81 = 119 |
| IoU | | 81 / 119 ≈ 0.681 (match at 0.5 threshold) |
| Architecture | What it does | When to use |
|---|
| FCN (Long 2015) | Replace FC tower with conv so output is spatial; semantic segmentation | First-deep approach; baseline |
| U-Net (Ronneberger 2015) | Encoder-decoder with skip connections from each encoder layer to its decoder counterpart | Medical imaging standard; clean spatial detail |
| Mask R-CNN (He 2017) | Faster R-CNN + per-region binary mask head; instance segmentation | When you need per-instance per-pixel masks |
| Technique | What it does | Cost |
|---|
| Saliency | Gradient of class score wrt input pixels | Cheap (one backward pass) |
| Occlusion sensitivity | Slide a mask, record where prediction drops | Slow (many forward passes); most faithful |
| Grad-CAM | Class-weighted feature maps near top of network | Cheap; works on any CNN; most-used in practice |
| Activation max / DeepDream | Gradient-ascend an input to max a neuron’s activation | Moderate; striking visuals |
| t-SNE / UMAP | Project deep features to 2D | Moderate; per-many-images |
All useful for intuition + debugging; NONE is a complete explanation. XAI is active research.
| What | Why |
|---|
| Training loop | L3 loss + L4 backprop run over any architecture |
| Gradient descent step | Unchanged; just a new loss + new architecture per task |
| Backprop | Carries gradients through detection heads, segmentation heads, mask heads identically |
| Pitfall | Reality |
|---|
| IoU 0.5 = pass, IoU 0.49 = fail (sharp) | IoU is continuous; 0.5 is convention; visually-fine predictions can fall below |
| Semantic = instance segmentation | Semantic: per-pixel class only. Instance: + per-instance ID. Different output, different architecture |
| Saliency / Grad-CAM = ground truth on why | Local approximations; useful first looks, not proof |
| One-stage always faster than two-stage | Usually, but depends on backbone, image size, framework, hardware |
Detection adds “where” to “what” (list of class+box per image, IoU evaluation); segmentation refines “where” to per-pixel (semantic vs instance, U-Net / Mask R-CNN); visualization adds partial “why” (saliency / Grad-CAM / t-SNE) as a debugging tool, not a complete explanation.