| Property | Internet text/image | Biology |
|---|
| Modalities per dataset | one or two | many (molecules, microscopy, transcriptomics, proteomics, phenotype, clinical text) |
| Scale per modality | billions to trillions | thousands to millions |
| Cost per example | near-zero | expensive (wet-lab time) |
| Noise | low | high; replication adds cost |
| Ground truth for the target | abundant | scarce (clinical outcomes are precisely what you want to predict) |
| Step | Action |
|---|
| 1 | encode each biological modality into a shared embedding |
| 2 | train multimodal transformer on co-occurring data (molecule + cell + outcome) |
| 3 | predict perturbation effects for new molecules or new cell systems |
| Connection to L7 | predict semantic biological state, not raw outputs (capacity-on-semantic-structure argument applies even more) |
| Public example | Noetik.ai’s OCTO and Perturb-map |
| Claim type | Example | Instruments to settle |
|---|
| ML benchmark | ”91% AUC on held-out cell-line response” | training/validation/test, AUC/F1/correlation |
| Clinical | ”30% improved patient survival vs standard-of-care” | randomized trials, clinical endpoints, regulatory review |
These are not the same claim and not on the same epistemic ladder.
“Model passes ML benchmark → therefore the drug it identified will work in patients.”
The single most important pitfall in medical AI. Benchmark performance does not establish clinical utility; routing the benchmark through this conflation is the standard medical-AI overreach.
| If the question is settled by… | It is… |
|---|
| ML benchmark / training loss / representation quality / generalization tests | IN SCOPE (technique) |
| Clinical trial / regulatory review / standard-of-care framework / patient consent process / clinical-practice judgment | OUT OF SCOPE (different conversation) |
| Category | Instruments |
|---|
| Diagnostic claims / clinical validity | clinical trials, gold-standard comparisons |
| Regulatory framework | FDA, EMA, sectoral medical regulators |
| Medical malpractice / standard-of-care | legal precedent, professional medical societies |
| Patient consent for AI involvement | bioethics, patient-advocacy frameworks |
| Clinical-trial methodology vs ML-evaluation methodology | translational science |
| Therapeutic claims (what to prescribe) | clinical-practice judgment, evidence-based medicine |
| Topic | Why |
|---|
| Model architecture (multimodal transformer fusing modalities) | technique |
| Training methodology (multimodal pretraining, contrastive losses, world-model objectives) | technique |
| Benchmark performance (per-task biological metrics) | evaluation |
| Representation quality (transfer to downstream tasks; latent organization) | evaluation |
| Generalization (cross-cell-line, cross-perturbation) | evaluation |
| Compute and data requirements | engineering |
| Pitfall | Reality |
|---|
| ”Model passes ML benchmark → clinically useful” | the gap is huge; clinical trials required, not better benchmarks |
| ”Bigger model fixes drug discovery” | biology is data-limited, not capacity-limited; bigger overfits more |
| ”Multimodal models handle all biological data uniformly” | heterogeneity is real; specialized representations sometimes needed |
| ”World models replace experimentation” | they guide experiments, not replace them |