Multimodal world models for science: cheatsheet

Why biology is harder data than the internet

Property	Internet text/image	Biology
Modalities per dataset	one or two	many (molecules, microscopy, transcriptomics, proteomics, phenotype, clinical text)
Scale per modality	billions to trillions	thousands to millions
Cost per example	near-zero	expensive (wet-lab time)
Noise	low	high; replication adds cost
Ground truth for the target	abundant	scarce (clinical outcomes are precisely what you want to predict)

The multimodal world model framing for drug discovery

Step	Action
1	encode each biological modality into a shared embedding
2	train multimodal transformer on co-occurring data (molecule + cell + outcome)
3	predict perturbation effects for new molecules or new cell systems
Connection to L7	predict semantic biological state, not raw outputs (capacity-on-semantic-structure argument applies even more)
Public example	Noetik.ai’s OCTO and Perturb-map

The sharp benchmark-vs-clinical line

Claim type	Example	Instruments to settle
ML benchmark	”91% AUC on held-out cell-line response”	training/validation/test, AUC/F1/correlation
Clinical	”30% improved patient survival vs standard-of-care”	randomized trials, clinical endpoints, regulatory review

These are not the same claim and not on the same epistemic ladder.

The named conflation pitfall

“Model passes ML benchmark → therefore the drug it identified will work in patients.”

The single most important pitfall in medical AI. Benchmark performance does not establish clinical utility; routing the benchmark through this conflation is the standard medical-AI overreach.

Operational scope test (medical-AI specialization)

If the question is settled by…	It is…
ML benchmark / training loss / representation quality / generalization tests	IN SCOPE (technique)
Clinical trial / regulatory review / standard-of-care framework / patient consent process / clinical-practice judgment	OUT OF SCOPE (different conversation)

Six categories OUT of scope (with instruments)

Category	Instruments
Diagnostic claims / clinical validity	clinical trials, gold-standard comparisons
Regulatory framework	FDA, EMA, sectoral medical regulators
Medical malpractice / standard-of-care	legal precedent, professional medical societies
Patient consent for AI involvement	bioethics, patient-advocacy frameworks
Clinical-trial methodology vs ML-evaluation methodology	translational science
Therapeutic claims (what to prescribe)	clinical-practice judgment, evidence-based medicine

What is IN scope

Topic	Why
Model architecture (multimodal transformer fusing modalities)	technique
Training methodology (multimodal pretraining, contrastive losses, world-model objectives)	technique
Benchmark performance (per-task biological metrics)	evaluation
Representation quality (transfer to downstream tasks; latent organization)	evaluation
Generalization (cross-cell-line, cross-perturbation)	evaluation
Compute and data requirements	engineering

Common pitfalls

Pitfall	Reality
”Model passes ML benchmark → clinically useful”	the gap is huge; clinical trials required, not better benchmarks
”Bigger model fixes drug discovery”	biology is data-limited, not capacity-limited; bigger overfits more
”Multimodal models handle all biological data uniformly”	heterogeneity is real; specialized representations sometimes needed
”World models replace experimentation”	they guide experiments, not replace them