Multimodal world models for science

The previous lesson framed world models as systems that predict the future semantic state of an environment, and argued that for many planning-oriented tasks predicting embeddings beats predicting raw outputs. The “environment” in those framings was usually a physical one (a robot’s surroundings, a video scene). This lesson takes the same world-modeling framing into a very different environment: biological. The world becomes cells, molecules, pathways, and patient tissue. The “future state” becomes how a drug candidate will perturb that biological system.

Drug discovery is one of the most consequential application domains for multimodal AI, and Noetik.ai (the source for this lesson) is among several groups applying the multimodal world model framing to it specifically. This lesson covers the data-fusion challenge biology raises that distinguishes it from internet-scale text and image data, the world-model framing as a response, and is sharply explicit about a distinction that has tripped up the medical-AI literature repeatedly: model performs well on a biological benchmark is not the same as model is medically useful.

Why biology is harder data than the internet

Phases 2 and 3 of this track rode on a comforting assumption: the internet has billions of images, trillions of words, and millions of hours of video. When you have that much data, many model design problems get easier because scaling dominates. Biology does not enjoy that abundance, and it is worth being precise about why.

Many modalities, none individually abundant. Drug discovery data includes molecular structures, cell microscopy images, gene expression measurements (transcriptomics), protein measurements (proteomics), phenotypic outcomes (does the cell die, divide, change shape), and clinical reports. Each of these is its own modality with its own measurement apparatus and its own representational conventions.
Each modality is much smaller than internet-scale. A single high-quality cell-imaging experiment produces thousands or hundreds of thousands of images, not billions. A transcriptomics dataset measures thousands of genes across hundreds or thousands of conditions, not internet text’s trillions of tokens.
Cost and noise. Biological experiments are expensive and noisy. Replicating an experiment to reduce noise costs real laboratory time. Data quality varies enormously across experiments and across labs.
Ground truth is often what you are trying to predict. For “will this molecule cure cancer,” there is no labeled dataset; the label exists only after expensive trials. The thing you most want to predict is the thing you have least data for.

The result is that scaling laws of the kind that drive text and image generation do not transfer cleanly. You cannot solve a drug discovery problem by training a 1-trillion-parameter model on 100 trillion tokens of clinical text. The bottleneck is not capacity; it is data that the world has not yet produced.

The multimodal world model framing

What you can do is fuse the heterogeneous data biology does produce into a shared representation and learn to predict perturbation outcomes across that representation. The framing is the same multimodal pattern we saw in L3 (tokenize everything, put it through one transformer) and L7 (predict in embedding space), applied to biological data instead of text and pixels:

Encode each modality (molecular structure, microscopy image, gene expression vector, etc.) into a shared embedding space.
Train a multimodal transformer (or family of encoders + a shared backbone) on co-occurring biological data: “this molecule applied to this cell line produced this gene-expression change and this phenotypic outcome.”
The trained model can predict aspects of the biological system’s response to a new perturbation: “given this molecule structure, what gene-expression change is expected in this cell line?”

This is a multimodal world model in the L7 sense: it predicts the semantic state of a biological system under intervention, rather than rendering raw experimental outputs pixel by pixel. The argument is the same as in L7, and it matters even more in biology than it did in video: when data is precious, you cannot afford to waste model capacity on rendering surface details (exact image noise, exact small measurement variations) when what you need is to predict the relevant biological signal.

Noetik.ai’s public framing of their OCTO and Perturb-map work positions exactly this: multimodal models of patient biology, trained on diverse data streams, used to predict the effect of interventions and to identify biological targets. The technical architecture is multimodal transformers; the scientific bet is that the world-model framing gives them traction biology has historically lacked.

What this approach can and cannot do, and the sharp line between them

Here is where this lesson differs most from generic multimodal-AI lessons, and where the discipline matters most. There is a distinction in medical AI that the literature has stumbled over repeatedly, and it deserves to be named clearly:

A model performing well on a biological benchmark is not the same thing as the model being medically useful.

The two claims are evaluated by entirely different instruments. The first is settled by ML benchmark performance, held-out test sets, cross-cell-line generalization studies, representation-quality analyses. The second is settled by clinical trials, regulatory review, gold-standard comparisons in patient populations. These instruments are not interchangeable, and the size of the gap between them is exactly what makes drug discovery slow and expensive even when AI predictions look good in silico.

Reading this gap honestly is essential to reading any “AI cures cancer” headline correctly. A multimodal world model that predicts cellular response to a molecule with 92% accuracy on a benchmark has done a real and useful technical thing. It has not, by that fact alone, demonstrated that any drug it suggests will work in humans. The trip from benchmark performance to clinical utility runs through experiments and trials that the model’s accuracy on a benchmark does not shortcut.

This is not an indictment of the technique. It is the same shape of discipline that any scientific application of AI demands: the model’s evaluation lives in its evaluation frame; the deployment-and-use questions live in different frames with different instruments.

Where this lesson stops, and what is a separate conversation

The same operational scope test applies, with medical-AI-specific categories. The discriminating instrument-based test from L6 and L7 carries forward verbatim: what instruments would you use to settle the question? If model benchmarks, training-loss analysis, representation-quality studies, or generalization tests settle it, it is technique and in scope. If the question needs clinical trials, regulatory review, malpractice frameworks, or patient consent processes to settle it, it lives in a different conversation evaluated by different methods.

Six categories specifically out of scope for this lesson (and their stakeholders / instruments):

Diagnostic claims and clinical validity. “Does this model produce diagnoses that doctors can act on?” is a clinical-trial and regulatory question, not a benchmark question. Stakeholders: clinical investigators, regulatory bodies, professional medical societies.
Regulatory framework. FDA, EMA, and sectoral medical regulators each have their own approval pathways for software-as-medical-device, AI-assisted diagnostics, and AI-driven drug discovery. Out of scope for the architecture lesson; an entire field in its own right.
Medical malpractice and standard-of-care implications. What does it mean for a clinician to rely on (or override) an AI system, and who is liable when things go wrong? Legal and institutional, not technical.
Patient consent for AI involvement in care. Informed-consent ethics and policy. Bioethics literature and patient-advocacy stakeholders.
Clinical trial methodology vs ML evaluation methodology. Two distinct epistemic frameworks: ML benchmarks measure model performance on held-out test data; clinical trials measure intervention effect on patient outcomes with controlled enrollment, blinding, and statistical analysis built for medicine. The translation between the two is itself a research field (translational science).
Therapeutic claims. What the model can predict is a technical statement; what physicians should prescribe is a clinical-practice judgment. The two are systematically different instruments.

The pitfall worth naming explicitly (the one this lesson exists in part to forestall): “the model passes an ML benchmark, therefore the drug it identified will work in patients.” This conflation reappears in the medical-AI literature and the press regularly, and it routes a benchmark result through a category jump that requires the clinical-trial-grade instruments to actually make. The model has done its technical work; the medical claim is a separate, much harder thing.

Why this matters when you use AI

When you read about AI advances in drug discovery, cancer diagnosis, or other medical applications, the right reflex is the operational scope test on whatever claim is in front of you. “Predicts cellular response to a molecule with X accuracy on a benchmark” is an ML benchmark claim and should be evaluated as one. “Improves outcomes in a randomized clinical trial of N patients” is a clinical claim and should be evaluated as one. They are not the same claim and they are not on the same epistemic ladder. Knowing the difference is what keeps you from being either dismissive of real progress or credulous about overreaching headlines.

Common pitfalls

“Model passes ML benchmark, therefore drug is clinically useful.” The single most important pitfall in this lesson. Benchmark performance does not establish clinical utility; the gap is enormous and the instruments that close it are clinical trials, not training-loss curves.
“Bigger model fixes drug discovery.” Biology is data-limited, not capacity-limited; a bigger model on the same small dataset overfits more, not less. The bottleneck is the data the world has produced, not the parameters you can throw at it.
“Multimodal models handle all biological data uniformly.” Heterogeneity is real; some biological data has dynamics that need specialized representations (time-series, graph-structured molecules) that a generic multimodal transformer handles poorly. Architecture choices matter.
“World models replace experimentation.” No. They guide which experiments to run (compound screening, targeted assays) and which not to. They do not replace the wet-lab loop; they make it more efficient.

What you should remember

Biology is data-limited and heterogeneous, which makes the scaling-laws story of internet-scale text and image generation a poor fit. You cannot solve a drug discovery problem by training a bigger model on more clinical text.
The multimodal world model framing applies anyway: encode each modality into a shared representation, train on co-occurring biological data, predict perturbation effects across the system. Same pattern as the rest of this track, biology-specific data.
Model benchmark performance is not clinical utility. The two are different claims evaluated by different instruments; the gap between them is exactly what makes drug discovery hard even when AI looks good in silico.
The operational scope test cuts cleanly here: if an ML benchmark settles the question, it is technique; if a clinical trial is required, it is in a different conversation.

The next lesson stays in production-multimodal territory but returns to consumer-product land: what changes when multimodal models live inside shipping products, where RL co-design with the product is the central engineering question, not the architecture.