Practice: Multimodal world models for science

Self-check

Seven short questions. Try to answer each one before opening the collapsible.

1. Why doesn’t the internet-scale data story (billions of images, trillions of tokens) transfer cleanly to drug discovery?

Show answer

Biology has many modalities (molecular structures, cell imaging, transcriptomics, proteomics, phenotypic outcomes, clinical reports), but each is individually much smaller than internet-scale, costly to produce, noisy, and the ground truth you most want to predict is the thing you have least data for. The bottleneck is data the world has not yet produced, not capacity you can scale.

2. What is the multimodal world model framing for drug discovery, in three steps?

Show answer

(1) Encode each modality (molecules, microscopy, gene expression, etc.) into a shared embedding space. (2) Train a multimodal transformer on co-occurring biological data (“this molecule applied to this cell produced this gene-expression change and this phenotype”). (3) Predict perturbation outcomes for new molecules or new cell systems.

3. What is the single most important pitfall this lesson names?

Show answer

The conflation “model passes ML benchmark, therefore drug is clinically useful.” Benchmark performance does not establish clinical utility; the gap between them is exactly what makes drug discovery slow even when in-silico predictions look good. Closing that gap requires clinical trials, not better benchmark scores.

4. State the operational scope test, applied to medical-AI questions.

Show answer

What instruments would you use to settle the question? If an ML benchmark, training-loss curve, or representation-quality study settles it, it is technique and in scope. If a clinical trial, regulatory review, malpractice framework, or patient-consent process is required to settle it, the question lives in a different conversation evaluated by different methods.

5. Name three medical-AI categories this lesson defers to other forums.

Show answer

Any three of: diagnostic claims and clinical validity (clinical trials), regulatory framework (FDA / EMA), medical malpractice and standard-of-care implications (legal frameworks), patient consent for AI involvement (bioethics), clinical-trial methodology vs ML-evaluation methodology (translational science), therapeutic claims (clinical-practice judgment).

6. Why does “bigger model” not solve drug discovery the way it solves text-and-image tasks?

Show answer

Drug discovery is data-limited, not capacity-limited. The internet-scale text and image story works because you have billions of examples; biology produces orders of magnitude less data per modality. A larger model on the same small dataset overfits more, not less. The bottleneck is the data the world has produced.

7. What is the right way to read “AI cures cancer” headlines, given this lesson?

Show answer

Apply the operational scope test. “Predicts cellular response with X accuracy on a benchmark” is an ML benchmark claim, evaluated by ML standards. “Improves outcomes in a randomized clinical trial of N patients” is a clinical claim, evaluated by clinical-trial standards. They are not the same claim and not on the same epistemic ladder; conflating them is the standard medical-AI overreach.

Try it yourself: which claim is which?

For each pair of headlines about the same research, identify which is the ML benchmark claim (in scope for this lesson) and which is the clinical claim (out of scope, different instruments). Explain why in one line.

Pair 1:
A. "New multimodal model predicts cell-line response to oncology compounds
    with 91% area-under-curve on a held-out test set."
B. "AI cures cancer: new drug identified by AI shows tumor regression in
    early Phase II trial."

Pair 2:
C. "Multimodal world model achieves 0.85 Spearman correlation with measured
    gene-expression changes across 5 cell lines."
D. "Patients treated with AI-recommended therapy showed 30% improved
    survival vs standard-of-care in randomized trial."

Show answer

A: ML benchmark claim, IN SCOPE. Held-out test set + AUC are ML instruments. The lesson covers exactly this. Evaluatable by training/validation/test methodology.
B: clinical claim, OUT OF SCOPE. Phase II trial + tumor regression are clinical instruments. Settlement requires patients, controls, and clinical-trial methodology that no benchmark provides.
C: ML benchmark claim, IN SCOPE. Spearman correlation against measured changes is an ML evaluation. The architectural and training-loss conversation is the right frame.
D: clinical claim, OUT OF SCOPE. Randomized trial + survival outcome are the gold-standard clinical instruments. Lives in clinical-trials and translational-science conversations.

The discriminating procedure throughout: if the headline cites ML metrics (AUC, accuracy, correlation, F1), it is making an ML-benchmark claim. If it cites patient outcomes, survival, trial enrollment, clinical endpoints, or regulatory milestones, it is making a clinical claim. Conflating the two (“the model passes its benchmark, therefore the drug works”) routes a benchmark result through a category jump that the benchmark alone cannot make.

Try it yourself: apply the medical-AI operational scope test

For each question about a multimodal world model for drug discovery, label it IN SCOPE (technique / evaluation) or OUT OF SCOPE (different conversation), and name the relevant instrument that would settle it.

A. How does the architecture's cross-modal attention compare to a
   modality-specific baseline on a held-out cell-line dataset?
B. Should physicians prescribe a drug whose mechanism was first identified
   by this multimodal model?
C. What is the model's generalization gap between in-distribution cell
   lines and out-of-distribution ones?
D. Does the FDA require additional safety review when AI-derived
   candidates enter clinical trials?
E. How does the model's representation-quality score compare to a JEPA-
   style training objective on the same dataset?
F. What is the malpractice exposure for a clinician who relies on this
   model's prediction?

Show answer

A: IN SCOPE. Instrument: held-out benchmark evaluation. Direct architecture comparison; this lesson’s primary territory.
B: OUT OF SCOPE. Instrument: clinical practice judgment + clinical evidence + medical ethics. Therapeutic-decision question; not an ML question.
C: IN SCOPE. Instrument: generalization gap analysis on out-of-distribution cell lines. Standard ML evaluation; lesson territory.
D: OUT OF SCOPE. Instrument: regulatory framework (FDA review pathways). Regulatory question, not technical.
E: IN SCOPE. Instrument: comparative benchmark performance + representation-quality metrics. Methodological-comparison question; in scope.
F: OUT OF SCOPE. Instrument: legal precedent + standard-of-care frameworks. Malpractice / liability question, different field entirely.

The pattern: A, C, E are settled by ML evaluation; B, D, F require entirely different instruments (clinical, regulatory, legal). The operational test cleanly discriminates the two without needing to memorize the category list.

Flashcards

Ten cards. Click any card to reveal the answer. Use the Print flashcards button for one card per page.

Q. Why is biology harder data than the internet?

Many modalities each individually much smaller than internet-scale, expensive to produce, noisy, and the ground truth you most want to predict (clinical effect) is what you have least data for. Scaling-laws stories from text and image generation do not transfer cleanly.

Q. What is the multimodal world model framing for drug discovery?

Encode each biological modality into a shared embedding, train a multimodal transformer on co-occurring biological data (molecule + cell + outcome), predict perturbation effects for new molecules and cell systems.

Q. Name three biological data modalities a multimodal drug-discovery model fuses.

Any three of: molecular structures, cell microscopy, transcriptomics (gene expression), proteomics, phenotypic outcomes (cell behavior), clinical reports.

Q. What is the single most important pitfall this lesson names?

“Model passes ML benchmark, therefore drug is clinically useful.” Benchmark performance and clinical utility are evaluated by different instruments; the gap between them is exactly what makes drug discovery slow.

Q. State the operational scope test for medical-AI claims.

What instruments settle the question? If ML benchmarks settle it = in scope (technique). If clinical trials, regulatory review, malpractice frameworks, or patient consent processes are required = different conversation.

Q. Why doesn't 'bigger model' fix drug discovery?

Biology is data-limited, not capacity-limited. A larger model on the same small dataset overfits more, not less. The bottleneck is the data the world has produced, not the parameters you can throw at it.

Q. Do AI-based world models replace experimentation in drug discovery?

No. They guide which experiments to run (compound screening, targeted assays) and which not to. They make the wet-lab loop more efficient; they do not replace it.

Q. Name two medical-AI categories deferred to other forums.

Any two of: diagnostic claims and clinical validity (clinical trials), regulatory framework (FDA/EMA), medical malpractice (legal frameworks), patient consent (bioethics), clinical-trial methodology vs ML evaluation (translational science), therapeutic claims (clinical practice).

Q. What is the connection between this lesson and L7 (JEPA + world modeling)?

Same world-model framing: predict the semantic state of a system under intervention, not raw outputs. The capacity-on-semantic-structure argument from L7 matters even more in biology because data is so much more limited than internet-scale.

Q. How should you read an 'AI cures cancer' headline?

Apply the operational test. If the underlying result is benchmark performance, it is an ML claim and evaluatable as one. If the underlying result is patient outcomes in a controlled trial, it is a clinical claim with much higher epistemic standards.