Practice: The human-centered view

Self-check

Seven short questions. Answer each before opening the collapsible.

1. What is the scope of this lesson, and what is explicitly out of scope?

Show answer

In scope: the engineering side of vision systems’ real-world behaviour. How failures arise mechanically, how bias is a property of training-data composition + architecture + evaluation, how to measure those properties, what design and process choices reduce them. Out of scope: policy debates around what to permit, regulate, or restrict; ethical-theory disputes; “should this technology exist” questions. Those questions are real and important; they belong in their own forums with the right stakeholders (legal, ethics, regulatory). The engineering view sharpens those debates with measurable inputs; it does not replace them.

2. Name five vision-system failure modes from the engineering catalog.

Show answer

(1) Distribution shift (degrade on data unlike training). (2) Adversarial examples (tiny perturbations flip predictions). (3) Out-of-distribution inputs (model confidently wrong on unfamiliar input). (4) Shortcut learning (model latches onto spurious correlation: e.g., wolf-vs-husky-snow). (5) Calibration / overconfidence (confidence scores misaligned with actual accuracy).

3. Where does bias come from mechanically?

Show answer

A model fits the statistical structure of its training data. Web-scraped image datasets reflect the demographics, geographies, and contexts of the web, which is not uniform. The model inherits the data’s skews. The Gender Shades audit (Buolamwini and Gebru 2018) measured face-detection accuracy disaggregated by skin tone and gender and found error rates several times higher for darker-skinned women than for lighter-skinned men; the mechanical explanation was straightforward training-set skew. Different training data, different bias profiles.

4. What is the first engineering step toward addressing bias?

Show answer

Measurement, via disaggregated (sub-group) evaluation. Instead of reporting one overall test accuracy, report accuracy per demographic / geographic / contextual sub-population. A model with 95 percent overall but 99 percent on one group and 60 percent on another is hiding important behaviour behind an aggregate. Without measurement, mitigation is guessing.

5. Name three engineering categories of bias mitigation.

Show answer

(1) Data-side: curate balanced training sets (Inclusive Images, FairFace); targeted collection for underrepresented sub-groups; datasheets for datasets (Gebru et al. 2018) standardize documenting composition and intended use. (2) Model-side: adversarial debiasing (model so a separate demographic predictor cannot recover the protected attribute); reweighting losses by sub-group; multi-task training with fairness-aware auxiliaries. (3) Evaluation-side: disaggregated reporting; stress-test sets that probe known weak sub-groups; pre-deployment audits.

6. What is the calibration question, and why does it matter for trustworthy deployment?

Show answer

Calibration: do the model’s confidence scores correspond to its actual accuracy? A model that says “95 percent confident” is well-calibrated if it is right 95 percent of the time when it makes that claim. Modern deep networks are often poorly calibrated by default. It matters because a calibrated system can defer uncertain cases (to a human operator, or refuse to act); a poorly-calibrated system acts with the same false certainty on its failure cases as on its successes. Engineering responses: temperature scaling, isotonic regression, deep ensembles.

7. What is the trustworthiness gap, and what closes it?

Show answer

The trustworthiness gap is the difference between benchmark accuracy (on a fixed test set drawn from training distribution) and real-world reliability (in production, where data drifts, edge cases appear, and sub-group disparities surface). It is closed by: monitoring in production (sub-group accuracy over time, OOD-input rate, confidence-distribution drift); calibration so the system knows when to defer; explicit human-in-the-loop design when automation cannot be fully trusted; pre-deployment stress-test evaluation that goes beyond the held-out test set.

Try it yourself: diagnose failures, design measurement, plan mitigation

Three exercises, about 15 minutes.

Part A: failure-mode diagnosis. For each described real-world failure, name the failure mode (distribution shift, adversarial examples, OOD inputs, shortcut learning, or calibration / overconfidence). Each scenario has one primary failure mode.

A self-driving model trained on sunny California weather degrades sharply when deployed in heavy snow on the same roads.
A radiologist deploys a tumor-detection model; she observes it consistently confidently misclassifies an unusual benign abnormality the training set did not include, instead of returning “I’m not sure.”
A wildlife-camera classifier “detects wolves” with 95 percent accuracy on the test set; researchers later find that 90 percent of the wolf images in training also had snow, and the model is essentially detecting snow.
A face-detection model has 95 percent accuracy on a balanced test set; in production, users find that a specific photo of a real face, with a small carefully-designed perturbation invisible to the human eye, gets misclassified as a chair.

Answers

Distribution shift. Training distribution (sunny California) differs from production distribution (snowy roads). Standard production failure; engineering responses include broader / more representative training data, domain adaptation on a small target-domain sample, or domain randomization.
Out-of-distribution (OOD) input. The unusual abnormality is genuinely outside the training distribution. The system needed to recognize its own uncertainty (calibrated confidence + OOD detection), not classify confidently. The right engineering response is an OOD detection head and human-in-the-loop escalation for low-confidence cases.
Shortcut learning. The model is using a spurious correlation (snow) instead of the actual signal (animal morphology). Diagnosis with saliency or Grad-CAM (lesson 8) would show the model attending to background rather than the wolf. Engineering responses: dataset curation to decouple the shortcut (wolves indoors, cats on snow), augmentation to disrupt the correlation, evaluation on splits that test for shortcut-reliance.
Adversarial example. A carefully-crafted small perturbation that flips a confident prediction. Engineering responses: adversarial training, certified-robustness methods, input-validation pipelines.

Part B: disaggregated evaluation arithmetic. A face-detection model is evaluated on 1,000 test images, balanced 250 per group across four demographic sub-groups (A, B, C, D). It correctly classifies 247/250 in group A, 245/250 in group B, 180/250 in group C, and 248/250 in group D. (1) What is the aggregate accuracy? (2) What is the per-group accuracy for each? (3) Does the aggregate number tell the story?

Worked answer

(1) Aggregate accuracy. Total correct: 247 + 245 + 180 + 248 = 920. Total: 1,000. Aggregate accuracy = 92.0 percent.

(2) Per-group accuracy.

Group A: 247/250 = 98.8 percent
Group B: 245/250 = 98.0 percent
Group C: 180/250 = 72.0 percent
Group D: 248/250 = 99.2 percent

(3) Does the aggregate tell the story?

No. The 92 percent aggregate masks a substantial sub-group disparity: three of four groups exceed 98 percent accuracy, but group C is at 72 percent, roughly 26-27 percentage points below the others. A user from group C is 14x more likely to be misclassified by this model than a user from group D. The model’s behaviour is highly inconsistent across sub-groups, and a single aggregate number hides exactly the kind of disparity that surfaces as a real-world failure. Disaggregated reporting is what makes the disparity visible; the aggregate alone, even when reasonably high, gives no information about whether the model treats all users comparably.

This is the engineering motivation for disaggregated reporting becoming a standard practice. It is also a typical Gender-Shades-style finding (one group’s accuracy substantially lower than others) and the kind of result that triggers data-side mitigation (more training data from group C, possibly augmentation, possibly model-side debiasing) before redeployment.

Part C: deployment plan. You are deploying a vision system to classify product photos at scale for an e-commerce platform. Your validation accuracy is 96 percent on a held-out test set. In 4-6 sentences, sketch a deployment plan that addresses the trustworthiness gap. Include at least: pre-deployment evaluation beyond the held-out set, what to monitor in production, how to handle uncertain predictions, and a failure-mode plan.

What a good answer looks like

Pre-deployment. Beyond the held-out test set, run disaggregated evaluation (per product category, per geographic source, per image-quality bucket) to find any sub-group disparities the aggregate hides. Run distribution-shift stress tests by evaluating on images from time periods or sellers not in training, to estimate degradation under realistic drift. Calibrate the model’s confidence scores (temperature scaling on a held-out set) so confidence numbers actually correspond to accuracy.

Production monitoring. Log per-category accuracy over time on a sample of human-labeled production traffic; alert when sub-group performance drops or when the distribution of input images shifts noticeably (changing seller mix, new product categories, etc.). Track the rate of low-confidence predictions; a rising OOD-input rate is an early signal that production data is moving away from training data.

Handling uncertain predictions. Route low-confidence predictions to human review rather than letting the model commit them at scale. The calibrated confidence score is what makes this routing possible. Set thresholds based on the calibrated accuracy you actually achieve, not the model’s raw output.

Failure-mode plan. Have an explicit rollback procedure (revert to previous model on detected degradation), a graceful-degradation path (fall back to simpler heuristics when the model is unavailable or untrusted), and an escalation chain for novel failure patterns (which engineer responds, what they can do).

The deeper point: deployment is not “ship the trained model”; it is the trained model plus the monitoring, plus the calibration, plus the human-review queue, plus the rollback plan. The trustworthiness gap is closed by engineering at all these layers, not by a better single model.

Flashcards

Nine cards. Click any card to reveal the answer. Use the Print flashcards button to lay out the full set as one card per page, ready to print or save as a PDF for offline review.

Q. What is this lesson's scope, and what is out of scope?

In scope: engineering side of vision systems’ real-world behaviour (failure modes, bias as training-data property, measurement and mitigation as design choices). Out of scope: policy/regulatory/ethical-theory debates; “should this technology exist” questions. Those belong with the right stakeholders.

Q. Five engineering failure modes for vision systems?

Distribution shift, adversarial examples, out-of-distribution inputs, shortcut learning, calibration / overconfidence. Each has named engineering responses (domain adaptation, adversarial training, OOD detection, dataset curation, calibration methods).

Q. Where does bias come from mechanically?

A model fits the statistical structure of its training data. Web-scraped datasets reflect uneven demographics, geographies, contexts. Model inherits the skews. Gender Shades 2018 audit measured face-detection accuracy disaggregated by skin tone + gender; the gap traced to training-set skew. Different data, different bias profiles.

Q. First engineering step toward addressing bias?

Measurement via disaggregated (sub-group) evaluation. Aggregate accuracy hides important behaviour; per-group accuracy reveals it. Without measurement, mitigation is guessing.

Q. Three engineering categories of bias mitigation?

(1) Data: balanced curation, targeted collection, datasheets for datasets. (2) Model: adversarial debiasing, loss reweighting, fairness-aware auxiliaries. (3) Evaluation: disaggregated reporting, stress-test sets, pre-deployment audits.

Q. What is calibration, and why does it matter for deployment?

Calibration: confidence scores correspond to actual accuracy. Well-calibrated systems know when to defer (human escalation, refuse to act); poorly-calibrated systems act with false certainty on failures. Methods: temperature scaling, isotonic regression, deep ensembles.

Q. The trustworthiness gap, in one sentence?

The gap between benchmark accuracy (on held-out test set) and real-world reliability (in production, with drift, edge cases, sub-group disparities). Closed by monitoring + calibration + human-in-the-loop + explicit failure-mode plans.

Q. Why is high test-set accuracy not a guarantee?

Distribution shift, shortcut learning, calibration issues, and tail-event behaviour all live in the gap between the test set and the real world. 95 percent test accuracy says the model can do well on the test set; it says nothing direct about deployment reliability.

Q. What does deployment include beyond the trained model?

Trained model + data pipeline + evaluation suite + monitoring dashboard + calibration + human-review queue + failure-mode plan (rollback, graceful degradation, escalation). The trustworthiness gap is closed at all these layers, not by one better model.