Building the Virtual Cell: AI Foundation Models and Billion-Cell Datasets

From early 2000s and a 2012 milestone to a decade-long quiet—virtual cell modeling is resurging with transformers, perturbation atlases, and lab-in-the-loop systems

Aug 07, 2025

∙ Paid

While the term “virtual cell” initially goes back to the early 2000s from the Trends in Biotechnology article “Whole-cell simulation: a grand challenge of the 21st century” by M. Tomita, mainstream attention to full-cell simulation began with the 2012 Stanford Mycoplasma model. After a long lull, there’s now a resurgence of interest in modeling whole cells—this time driven by advances in machine learning, multimodal omics, and large-scale data integration.

Cells are central to understanding health, aging, and disease. They are also the primary testbed for applications in drug development and synthetic biology. However, cell-based experiments are costly and highly variable, raising concerns about reproducibility in biomedical research.

The ambition behind virtual cells is to build in silico models that can simulate, predict, and manipulate cellular behavior, reducing reliance on physical experimentation. A functioning virtual cell could accelerate hypothesis testing, guide therapeutic development, identify causal mechanisms of disease, and enable scalable experimentation—particularly in cases where direct observation is impractical or impossible.

As these models evolve, one long-term practical goal is to develop patient-specific “virtual twins”: dynamic, data-integrated simulations of individual cellular systems that can forecast treatment responses and inform personalized interventions.

In this article: The outline of a modern Virtual Cell — Pre-AIVC era — The AIVC — Big Tech Enters — No Data-No Party — Reality Check — The Holy Grail of Digital Biology

In order to investigate a cell’s activity, researchers have tried to build virtual cell modes that would enable predicting, simulating and directing the cell's behaviour. The concept of a virtual or digital cell, initially relied on traditional, low-throughput biochemical assays to measure changes in substances over time and space during specific biological processes. Also, early virtual cell models used differential equations and stochastic simulations to describe particular cellular functions.

A landmark pioneering effort was the model for Mycoplasma genitalium in 2012 released by Markus Covert’s group in Stanford. It included all 525 known genes and respective molecular functions known a priori. Although being breakthrough at the time, mechanistic biological modeling approaches struggle to accurately capture complex cellular behaviors due to challenges like multi-scale interactions spanning atomic to cellular levels, interplay among diverse biomolecular processes, and highly nonlinear dynamics. These factors made it hard to build complete and reliable virtual cell models.

The outline of a modern Virtual Cell

Recent advances in AI and omics led to the concept of the AI virtual cell (AIVC), introduced in late 2024 in the Cell perspective article "How to build the virtual cell with artificial intelligence: Priorities and opportunities," authored by a large interdisciplinary team spanning EPFL, Stanford, the Chan Zuckerberg Initiative, Genentech, Google Research, Harvard, and others, with Charlotte Bunne as a co–first author.

Framed as foundation models for cell biology, AIVCs are designed to learn generalizable, high-dimensional representations from diverse cellular data—enabling transfer across cell types, modalities, and experimental tasks.

13 Foundation Models: Startups, Industry Updates and the Nobel Prize

BiopharmaTrend

March 28, 2025

Read full story

Authors define AIVC as “a comprehensive AI framework composed of several interconnected foundation models that represent dynamic biological systems at increasingly complex levels of organization—from molecules to cells, tissues, and beyond.”

Capabilities of AIVC. From “How to build the virtual cell with artificial intelligence: Priorities and opportunities” License: CC-BY-4.0

The AIVC bears the following capabilities and features:

Universal Representations (URs). URs are the core elements of the AIVC, enabling it to encode biological states across molecular to multicellular scales and across species. They form a shared embedding space where diverse data types and modalities are integrated cohesively. Crucially, URs generalize beyond training data, allowing the AIVC to infer novel biological states and serve as a robust reference framework for analysis and discovery.
Prediction of Cell Behavior. The AIVC models how cells behave under natural and perturbed conditions—genetic, chemical, or environmental. It can forecast cell state transitions over time, even in untested scenarios. Beyond prediction, it simulates interventions to uncover causal mechanisms behind phenotypes, offering deep insights into cellular functions.
In Silico Experimentation. With virtual instruments (VIs), the AIVC conducts advanced in silico experiments that replicate lab procedures. These simulations guide wet-lab priorities, support synthetic biology design, and model hard-to-measure systems. The platform also uses active learning to identify knowledge gaps and recommend targeted data collection, improving itself through an iterative loop of prediction, experimentation, and refinement.

One long-term vision for the AIVC is to support the development of virtual twins—patient-specific models that simulate cellular behavior under different treatment conditions, potentially enabling personalized intervention planning and predictive diagnostics.

12 Startups in the Digital Twin Healthcare Ecosystem: From Virtual Organs to Optimized Trials