Skip to content

Summary: Multimodal world models for science

Biology is data-limited and heterogeneous (many modalities, none individually internet-scale), so the multimodal world model framing from L7 (encode each modality, train on co-occurrences, predict perturbation effects) is the right approach. The single most important discipline this lesson teaches: model performance on a biological benchmark is not the same as the model being medically useful. Settling the second claim requires clinical-trial-grade instruments that benchmark performance does not provide. This summary is the scan version of the full lesson.

  • Biology is harder data than the internet. Many modalities (molecular structures, microscopy, transcriptomics, proteomics, phenotype, clinical text), each much smaller than internet-scale. Cost and noise are real; the ground truth you most want to predict (clinical effect) is what you have least data for. Scaling-laws stories transfer poorly.
  • The multimodal world model framing applies anyway. Encode each modality into a shared embedding; train a multimodal transformer on co-occurring biological data; predict perturbation outcomes. Same pattern as L3 / L7, applied to biology. Noetik.ai’s OCTO and Perturb-map are public examples.
  • The capacity-on-semantic-structure argument from L7 matters even more here. When data is precious, you cannot afford to spend model capacity rendering surface details (exact noise, exact small variations). Predicting semantic biological state is exactly the right level of abstraction for the downstream questions.
  • The sharp benchmark-vs-clinical line. Model performing well on a biological benchmark is not the same thing as the model being medically useful. The two claims are settled by entirely different instruments; the gap between them is exactly what makes drug discovery slow even with strong in-silico predictions.
  • The operational scope test specialized to medical AI: what instruments would you use to settle the question? If ML benchmarks settle it, technique and in scope. If clinical trials, regulatory review, malpractice frameworks, or patient-consent processes are required, the question lives in a different conversation evaluated by different methods.
  • Six medical-AI categories deferred to other forums: diagnostic claims and clinical validity, regulatory framework, medical malpractice and standard-of-care implications, patient consent for AI involvement, clinical-trial methodology vs ML-evaluation methodology, therapeutic claims.

When you read about AI advances in drug discovery, cancer diagnosis, or medical applications, the right reflex is the operational scope test on whatever claim is in front of you. ML benchmark claims (AUC, accuracy, correlation, F1 against held-out data) are evaluated by ML standards; clinical claims (patient outcomes in controlled trials, regulatory approvals, survival improvements) are evaluated by clinical standards. The conflation “model passes ML benchmark, therefore drug works in patients” is the standard medical-AI overreach, and naming it explicitly is the load-bearing discipline of this lesson. The pattern of categories + methods + operational test from Phase 3 transfers cleanly to the medical-AI watch zone with category-specialization (ML evaluation in, clinical trials out). The next lesson stays in production-multimodal territory but returns to consumer-product land: what changes when multimodal models live inside shipping products, where RL co-design with the product is the central engineering question.