References: Detection, segmentation, and visualization

Source material

This lesson follows Stanford CS231n’s coverage of object detection, image segmentation, and visualizing-and-understanding CNNs, all combined in CS231n’s Lecture 9.

Course: Stanford CS231n, “Deep Learning for Computer Vision”
Instructors: Fei-Fei Li, Ehsan Adeli, and Justin Johnson (Stanford University)
Course site: cs231n.stanford.edu
This lesson maps to: Lecture 9 (Object Detection, Image Segmentation, Visualizing and Understanding).

Attribution (Clawdemy-authored): Stanford CS231n: Deep Learning for Computer Vision, Fei-Fei Li, Ehsan Adeli, and Justin Johnson, Stanford University (cs231n.stanford.edu). CS231n does not publish a required citation string; this is the attribution Clawdemy uses.

A note on access and license

The current term’s lecture recordings are posted on Canvas for enrolled Stanford students. Recordings from previous years are publicly available on YouTube under YouTube’s standard license; Clawdemy links out rather than embedding or rehosting. The course notes (cs231n.github.io) and site are Stanford’s. No Creative Commons license is published for the lectures, so we treat them as link-only references.

Primary architecture papers (cited by name and venue)

Detection

R-CNN. Girshick et al., “Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation” (CVPR 2014). The original two-stage detector that brought CNNs to detection.
Fast R-CNN. Girshick, “Fast R-CNN” (ICCV 2015). The single-forward-pass speedup with ROI pooling.
Faster R-CNN. Ren, He, Girshick, Sun, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks” (NeurIPS 2015). End-to-end with a learned RPN; the canonical two-stage baseline.
YOLO. Redmon et al., “You Only Look Once: Unified, Real-Time Object Detection” (CVPR 2016 / arXiv 2015). The original one-stage detector.
SSD. Liu et al., “SSD: Single Shot MultiBox Detector” (ECCV 2016). One-stage with multi-scale feature maps.
RetinaNet. Lin et al., “Focal Loss for Dense Object Detection” (ICCV 2017). Addressed the class-imbalance issue that hurt earlier one-stage detectors.

Segmentation

FCN. Long, Shelhamer, Darrell, “Fully Convolutional Networks for Semantic Segmentation” (CVPR 2015). First-deep approach to per-pixel labelling.
U-Net. Ronneberger, Fischer, Brox, “U-Net: Convolutional Networks for Biomedical Image Segmentation” (MICCAI 2015). The encoder-decoder-with-skips architecture; medical-imaging standard.
Mask R-CNN. He, Gkioxari, Dollár, Girshick, “Mask R-CNN” (ICCV 2017). Best-paper award; the canonical instance-segmentation architecture.

Visualization

Saliency maps. Simonyan, Vedaldi, Zisserman, “Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps” (ICLR 2014 workshop).
CAM (Class Activation Mapping). Zhou et al., “Learning Deep Features for Discriminative Localization” (CVPR 2016).
Grad-CAM. Selvaraju et al., “Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization” (ICCV 2017). The most-used class-localization technique in practice.
DeepDream. Mordvintsev, Olah, Tyka, “Inceptionism: Going Deeper into Neural Networks” (Google Research blog 2015). Activation-maximization at scale.
t-SNE. van der Maaten and Hinton, “Visualizing Data using t-SNE” (JMLR 2008). The standard high-dimensional-feature visualization technique.
UMAP. McInnes, Healy, Melville, “UMAP: Uniform Manifold Approximation and Projection” (arXiv 2018). A faster modern alternative to t-SNE.

Further study

Detection production stacks. TorchVision’s detection module (PyTorch) and the open-source Detectron2 framework (Facebook AI Research) implement Faster R-CNN, Mask R-CNN, RetinaNet, and others with consistent APIs; recommended reading if you want to actually run any of these.
Segmentation in medical imaging. The original U-Net paper plus the nnU-Net framework (Isensee et al. 2018) for production medical-image segmentation pipelines.
Interpretability beyond saliency. Olah et al., “The Building Blocks of Interpretability” (Distill 2018) and the Anthropic / OpenAI mechanistic-interpretability papers from 2022 onward, for where the XAI field has gone past the techniques covered here.

How we use this source

Clawdemy follows CS231n’s Lec 9 ordering (detection, then segmentation, then visualization) and surveys the canonical architectures at intuition level with primary-paper citations. The IoU worked examples (body: predicted (0,0,10,10) vs ground-truth (1,1,11,11) → IoU 81/119 ≈ 0.681 match; practice: (2,2,12,12) vs (5,5,15,15) → IoU 49/151 ≈ 0.325 no-match) are Clawdemy-authored against the standard IoU formula. The visualization-investigation exercise in practice (the wolf-vs-husky-style spurious-feature investigation plan) and the “honest caveat” framing on XAI are also Clawdemy-authored. We do not reproduce CS231n’s slides, figures, problem sets, or lecture text. Full attribution policy: see Doc/attribution-policy.md.