Weekly Tech+Bio Highlights #44: New 9.8 Billion Protein Dataset Built for Generative Biology

Rethinking IP in the Age of AI Protein Design, Compressing Medical Images With Clinical Signal Intact, & Where AI in Drug R&D Is (and Isn’t) Gaining Ground

Jun 16, 2025

Basecamp Research announced one of the largest protein datasets to date sourced from over one million newly sampled species collected across extreme and biodiverse environments worldwide, Recursion downsized its team, and NIH opened the floor for public input on how AI should shape U.S. healthcare. Meanwhile, Sofinnova and NVIDIA team up to supercharge bio-AI startups, 23andMe story continues, and European funders are being urged to seize a rare chance to attract U.S. research talent amid deep NIH cuts.

New foundation models rolled out across small molecule design, protein generation, and simulated cell responses. Agentic systems are beginning to show up in real R&D tasks, from querying omics to automating trial operations. Several of these launches coincided with conference showcases and infrastructure partnerships.

The highlights above are a quick summary—hyperlinked stories with more detail are organized in the sections below.

📯 The London Biotechnology Show returns this week on June 18-19, featuring themes in AI-driven bioprocessing, cell and gene therapy, and digital infrastructure—stay tuned for our post-event coverage!

Hi! This is BiopharmaTrend’s weekly newsletter, Where Tech Meets Bio, where we explore technologies, breakthroughs, and cutting-edge companies.

If this newsletter is in your inbox, it’s because you subscribed, or someone thought you might enjoy it. In either case, you can subscribe directly by clicking this button:

🤖 AI x Bio

(AI applications in drug discovery, biotech, and healthcare)

🔹 Basecamp Research unveils BaseData, a 9.8B-protein dataset built from samples across 26 countries—expanding the known tree of life 10x and breaking the “data wall” (whitepaper) stalling AI in biology; now training foundation models with NVIDIA to unlock new therapeutic and sustainability insights; read more below.

🔹 Evogene unveils a first-in-class generative AI foundation model for small molecule design, developed with Google Cloud, achieving ~90% precision in generating novel, patentable compounds for pharma and agri-tech.

🔹 Recursion trims workforce—AI-driven biotech lays off 20% of staff following the cut of several clinical programs, aiming to extend its $500M cash runway into 2027 and refocus on oncology and rare disease R&D after its Exscientia merger. In a STAT op-ed CEO Chris Gibson reflects on layoffs amid biotech's "rainy season"; the company also appoints Lina Nilsson as Chief Platform Officer.

🔹 IQVIA and NVIDIA launch AI orchestrator agents to streamline clinical trials and drug commercialization, using agentic systems to automate data extraction, trial start-up, and sales insights

🔹 Stanford and Microsoft trial shows AI as a diagnostic “teammate” boosts physician accuracy—in a new RCT, clinicians collaborating with GPT-based AI achieved up to 85% diagnostic accuracy. (Eric Horvitz)

🔹 Tahoe Therapeutics (formerly Vevo) debuts an AI agent for querying massive omics datasets—built with Kepler AI (Keplogic) it lets biologists explore the 100M-cell Tahoe-100M dataset using natural language, generating code, visualizations, and literature-aware insights. (Nima Alidoust)

🔹 Stanford’s CellVoyager is an AI agent that autonomously generates and tests hypotheses from single-cell RNA-seq data—using a dual-loop setup with an LLM planner and vision-language model interpreter, outperforming GPT-4o and o3-mini by up to 20% on the CellBench benchmark.

🔹 Custom-designed proteins boost implant integration and healing—researchers at the Institute for Protein Design (led by Xinru Wang and Jordi Guillem-Marti) developed NeoNectins, synthetic proteins that bind integrin α5β1 to enhance tissue regeneration and implant adaptation.

🔹 Stanford pilots AI assistant for medical records—ChatEHR allows clinicians to interact with electronic health records through natural language.

🔹 Sofinnova-backed startups get NVIDIA cloud GPUs to scale bio-AI—as part of a new partnership, Sofinnova portfolio companies (Bioptimus, Cure51, BioCorteX, and Latent Labs) gain access to NVIDIA DGX Cloud Lepton, accelerating large-scale biological AI models for tasks like spatial biology, protein design, and microbiome-drug interaction modeling.

🔹 Phare Bio, a Cambridge-based AI-driven antibiotic discovery venture, joins the Google.org Generative AI Accelerator to expand its open-access platform, which uses generative models and biological screening to design new antibiotic candidates against drug-resistant infections.

🔹 BostonGene’s AI-driven multiomic platform uncovered therapeutic targets for invasive lobular carcinoma, earning the GRASP Advocate Choice Award at ASCO 2025 for identifying subtype-specific genomic alterations that could guide precision drug development.

🔹 Free AI comparison tool for clinicians—Ruslan Nazarenko debuts HealthOlymp, offering doctors no-cost access to the latest medical AI models with side-by-side output comparison and voting, aiming to reduce clinical burden and promote transparency in AI safety and accuracy.

🔹 Diffuse Bio launches DSG2-mini and DiffuseSandbox to democratize AI-driven protein design—the startup unveiled a new compute-efficient model for nanobody design, DSG2-mini, now accessible via DiffuseSandbox, a user-friendly web app that lets researchers design protein binders without coding skills.

🔹 Fujifilm and Ibex integrate AI into SYNAPSE Pathology—the partnership launches at North Bristol NHS to speed up prostate, breast, and gastric cancer diagnosis.

🔹 Tempus and Northwestern University partner on AI-driven Alzheimer’s research—Tempus will apply its Lens AI platform to analyze genomic data from the Abrams Research Center on Neurogenomics.

🔹 Moon Surgical brings Physical AI to the OR—at GTC Paris, Moon Surgical showcased its AI-robotics platform powered by NVIDIA edge computing, now used in 1,500+ procedures to support real-time decision-making and team coordination in minimally invasive surgery.

🔹 Peptone and NVIDIA unveil AI model to decode disordered proteins at GTC Paris—PepTron-o is an ensemble-based structure predictor for intrinsically disordered regions (IDRs), a major blind spot in protein modeling.

🔹 AI generates lab-like gene expression profiles from scratch—Jeff Leek showcases new results from Synthesize Bio's foundation model that simulates cellular responses to interferon-alpha, even in unseen cell lines. This echoes scGen, a 2019 model by Mo Lotfollahi (Sanger Institute) et al. that first demonstrated out-of-sample perturbation prediction using variational autoencoders.

🔹 Fable Therapeutics appoints ex-AstraZeneca exec David J. Baker, PhD, as CSO to lead its AI-driven protein design platform for obesity and MASH, using structure- and sequence-based language models.

🚜 Market Movers

(News from established pharma and tech giants)

🔹 What’s going on with bispecific antibodies?—Byron Fitzgerald highlights an uptick in bispecific antibody (BsAb) development, as big pharma bets billions (e.g., BMS-BioNTech, Pfizer-3SBio), CDMOs expand global manufacturing, and new trispecifics show promise in solid tumors.

🔹 Novartis targets ageing biology as next drug frontier—reported by Jessica Davis Plüss for swissinfo.ch, Novartis is doubling down on ageing research through AI-powered partnerships like BioAge and a new internal unit exploring targets like exercise biology and muscle preservation.

🔹 23andMe story continues—Anne Wojcicki reclaims company via nonprofit—TTAM Research Institute, led by 23andMe’s co-founder, outbid Regeneron with a $305M offer to acquire the bankrupt company’s assets.

💰 Money Flows

(Funding rounds, IPOs, and M&A for startups and smaller companies)

🔹 AstraZeneca signs a $5.3B deal with CSPC Pharmaceutical to co-develop oral drugs for chronic diseases using CSPC’s AI-driven drug discovery platform, with $110M upfront and potential milestones tied to development and sales.

🔹 XtalPi acquires Liverpool ChiroChem (LCC) to integrate its automated chiral chemistry platform into XtalPi’s AI–quantum–robotics stack.

🔹 Caris Life Sciences files for a $400M Nasdaq IPO under ticker CAI, aiming for a valuation over $5.2B as it expands its AI-powered cancer diagnostics, following FDA approval of its MI Cancer Seek test and 31% clinical growth in Q1 2025.

🔹 SpliceBio raises $135M to advance eye gene therapy—Backed by Sanofi, Roche, and others, SpliceBio will fund phase 1/2 trials of its AAV therapy for Stargardt disease and expand its pipeline using a protein splicing platform to deliver large genes.

⚙️ Other Tech

(Innovations across quantum computing, BCIs, gene editing, and more)

🔹 Vascularized organoids—Stanford Medicine researchers developed cardiac and hepatic organoids with self-formed blood vessels, overcoming a major size and maturity barrier on a path to more realistic disease models and regenerative therapy.

🔹 Emulate launches AVA, a benchtop Organ-on-a-Chip workstation that cultures and images 96 samples per run—designed to generate structured datasets at scale for downstream AI tools in drug discovery and toxicity prediction.

🔹 Mind-controlled tech enters the Middle East—Neuralink launches its first regional clinical trial in Abu Dhabi, testing brain-computer interfaces to help people with motor and speech impairments control devices via thought.

🔹 Perfused human brains meet AI for CNS drug discovery—Bexorg partners with Biohaven to apply its AI-driven platform using metabolically active postmortem human brains to accelerate two CNS preclinical programs, aiming to improve biomarker discovery, target validation, and drug response prediction.

🏛️ Bioeconomy & Society

(News on centers, regulatory updates, and broader biotech ecosystem developments)

🔹 Europe moves to capture brainpower lost to US funding cuts—chief editor Barbara Cheifet (Nature Biotechnology) reflects on how deep NIH cuts under current administration create a rare opening for Europe to attract top research talent (see latest $560M package) and invest strategically if funders act boldly and scale support to match the opportunity.

🔹 NIH seeks input on AI's future in healthcare—the agency is developing its first AI strategic plan and is calling for public comment on how AI should support research, clinical care, and public health.

🔹 Are we evaluating the Virtual Cell all wrong?—in Nature Biotechnology, Hanchen Wang, Jure Leskovec, and Aviv Regev (Stanford, Genentech) argue that common metrics fall short for assessing single-cell embeddings, a key component of virtual cell models.

🔹 UK launches OpenBind to lead global AI-driven drug discovery backed by £8M from the newly formed Sovereign AI Unit; aims to generate over 500,000 protein-ligand structures (a 20x increase over all public data from the past 50 years) to train next-gen AI models and cut R&D costs by £100B. Startups including Isomorphic Labs, Astex, Genetech, Chai Discovery, and Genesis Therapeutics join forces with academic leaders from Oxford, Diamond Light Source, Columbia, and University of Washington (David Baker).

🚀 A New Kid on the Block

(Emerging startups with a focus on technology)

🔹 Boston-based Sesen launches to provide AI-assisted translation and localization for clinical trials, labeling, and regulatory submissions, combining its domain-trained SesenGPT model with expert review to support content in over 150 languages.

🔹 Parallel Bio raises $21M to scale organoid-based human immune modeling for drug discovery—backed by AIX Ventures and Salesforce’s Marc Benioff, Parallel Bio has closed a $21M Series A to expand its AI-powered platform that replaces animal testing with 3D human immune organoids. The company (co-founded by Robert DiFazio and Juliana Hilliard) already has 8 pharma partners (including 3 Fortune 500s) and just completed a preclinical study with Centivax validating a universal flu vaccine in human-derived immune organoids.

🔹 Coherence Neuro emerges with a discreet brain-computer interface targeting cancer's electrical signaling—the startup (formerly opto.bio) is pioneering "electro-oncology" with SOMA-1, an MRI-transparent implant that monitors and modulates cancer-related nerve activity, with a first-in-human trial for glioma planned in October (Ben Woodington).

This newsletter reaches over 8.8K industry professionals from leading organizations across the globe. Interested in sponsoring?

Building a Biological Dataset for the AI Era

Basecamp Research put out a whitepaper detailing BaseData, which it says is the largest sequence dataset purpose-built for biological model training—around 9.8 billion protein sequences and a million newly sampled species. That is, according to Basecamp, roughly a 10x increase in protein diversity over all public databases combined.

The data was collected through a global network of 125+ community partnerships in 26 countries, including samples from shipwrecks, volcanoes, and Antarctic soils. Highlights include a hydrogen-powered bacterium from Antarctica, a metal-tolerant Burkholderia from a WWII shipwreck, and a thermoacidophilic Sulfolobaceae relative from volcanic hot springs.

Source: Preprint “Breaking Through Biology’s Data Wall: Expanding the Known Tree of Life by Over 10x using a Global Biodiscovery Pipeline”; Basecamp Research

To standardize sampling, Basecamp developed mobile molecular biology tools for real-time, in-field DNA extraction and analysis. BaseData is designed with context-aware modeling in mind: rather than treating genes in isolation, the dataset retains longer genomic context windows (up to 10,000+ base pairs) to enable richer training signals for foundation models.

The data wall, and the plateau

According to the company, most biological models today still train on clustered versions of public datasets like UniRef50—resources that capture only a fraction of known biological diversity and were never designed for machine learning. Analyses show that over two-thirds of SRA data comes from just five species, and once redundancy is removed, large pretraining corpora like BFD and PPA-1 expand protein diversity by less than 4x.

This bottleneck mirrors what’s been observed in other domains like NLP and vision: once models outgrow the available high-quality data, performance plateaus. In biology, this is now evident in benchmarks like ProteinGym, where model accuracy in zero-shot tasks saturates despite continued increases in parameter count.

Basecamp frames this slowdown as a systemic limitation (data structure over just volume) via overrepresented taxa, sparse annotations, and fragmented repositories constraining model generalization across the tree of life.

The preprint describing the dataset is available on the company’s site. Early access is being offered to researchers, and a follow-up analysis is planned to assess how broader sequence diversity impacts AI model performance. The work hasn’t been peer-reviewed yet.

Rethinking IP in the Age of AI Protein Design

Deniz Kavi (Tamarind Bio) raises a pointed question: what happens to biologics IP when AI can generate functional protein variants with low sequence identity?

Tools like ProteinMPNN, developed by the Baker Lab, can redesign large portions of a protein (like an antibody or enzyme) while retaining its structure and function. Variants with <50% identity to the original have been shown to fold and perform similarly. Spinouts like IgDesign and AntiFold specifically target antibody design by substituting CDRs while maintaining overall geometry.

This poses a challenge to IP strategies that hinge on sequence identity thresholds (e.g., ≥80%). In antibodies, Amgen v. Sanofi (2023) tightened patentable claims to exact CDR combinations, potentially leaving biosimilars a few mutations away from commercial viability—even before patent expiry.

For enzymes, the situation is more variable (identity cutoffs range from 50% to 95%) but structural and functional data are increasingly needed to defend claims.

Kavi predicts a shift away from sequence-based definitions of novelty. He cites Brian Naughton’s proposal to use structural alignment tools like TM-align, where a TM-score ≥0.8 suggests shared topology and likely non-novelty. Kavi also emphasizes that while discovery may become cheaper through AI, clinical trial costs remain unchanged—and will continue to dominate total development expenses.

Robert A. Bohrer argues that AI-driven enumeration may revive broad, epitope-based patent strategies, where claims cover all antibodies that bind in a defined way—not just the ones that have been physically produced.

The main risk: sequence identity may no longer be a defensible proxy for novelty. For biosimilars, that opens opportunity. For enzyme patents based solely on sequence, it may mean increased exposure to AI-enabled workarounds.

Compressing Medical Images With Clinical Signal Intact

A team at Stanford has released MedVAE, a family of six large-scale 2D and 3D autoencoders trained on over one million medical images. The core idea is simple: encode high-resolution images into lower-dimensional latent spaces that still preserve the clinical signal, then decode them back when needed.

This setup targets a growing bottleneck in medical AI: how to train or deploy models on ever-larger imaging datasets without prohibitive storage or compute costs.

Overview of the MedVAE training and evaluation workflow, including two-stage model training and downstream assessment of latent representations and reconstructions for clinical relevance and efficiency. *From Varma et al., "MedVAE: Efficient Automated Interpretation of Medical Images with Large-Scale Generalizable Autoencoders", arxiv.org*

Why this matters

High-res medical images (like 1024x1024 X-rays or 256x256x256 CTs) are standard because of how subtle clinical features can be. But this makes storage and model training computationally expensive, especially in 3D. Many groups downsample with interpolation, but that often degrades performance. MedVAE suggests an alternative, to train domain-specific autoencoders that learn to compress and reconstruct while preserving diagnostic content.

The setup

MedVAE uses a two-stage training scheme:

Stage 1 trains base autoencoders on image reconstruction, using perceptual loss, adversarial objectives, and embedding consistency.
Stage 2 refines the latent space to better retain clinical features across modalities, using a CLIP-style alignment with BioMedCLIP embeddings (for 2D) or full 3D model fine-tuning (for volumetric data).

The model family includes four 2D and two 3D variants, with downsizing factors up to 512x.

Results

Latent representations can replace high-res images in CAD pipelines with minimal or no drop in performance. In some cases, classifiers trained on MedVAE latents outperformed those trained on the original images.
Efficiency gains were substantial: up to 70x faster throughput and 512x storage savings for 3D models.
Reconstructions preserved clinically relevant features well, according to both automated metrics (PSNR, MS-SSIM) and expert radiologist evaluations.
MedVAE generalized across modalities (X-rays, MRIs, CTs) and anatomies, even to unseen types like wrist fractures.

Compared to natural image autoencoders (KL-VAE, VQ-GAN), MedVAE performed consistently better on both reconstruction and classification metrics.

The open-source code is available on GitHub.

2025 Market Map: Where AI in Drug R&D Is (and Isn’t) Gaining Ground

CB Insights spring market map tracked 225 companies applying AI across the drug development pipeline—from target discovery through clinical trials. While the overall picture is one of expansion, the pace and maturity vary widely by phase.

Clinical development tools are the furthest along: 37% of companies in this segment are already scaling or commercially established, compared to just 7% in preclinical. This likely reflects the fact that clinical-stage tools can plug into existing healthcare infrastructure, while earlier stages involve more speculative science and heavier regulatory lift.
Funding patterns follow suit: In 2024, AI drug R&D funding rebounded to $3.8B (up from $3B in 2023). Discovery engines (platforms generating therapeutic assets using proprietary AI) drew the largest share. Notable raises included Isomorphic Labs ($600M Series A), Enveda ($130M Series C + $20M from Sanofi), and several deals in quantum-enabled design and trial management.
Operational tools are attracting attention: Patient recruitment platforms and trial management systems showed strong year-over-year deal growth and scored highest on CB Insights’ Mosaic scores for private company health. Paradigm Health and Lindus Health stood out here, landing large rounds and partnering with institutions like Japan’s National Cancer Center and CDISC.

Preclinical AI tools remain earlier-stage—81% of funding since 2023 went to early-stage companies—and offer less evidence of commercial traction for now. But they may still represent early positioning opportunities for investors willing to take on more technical risk.

In short: AI is becoming more embedded in drug R&D workflows, but so far the commercial momentum is clearest where the tech helps clean up the operational mess—especially in clinical execution and infrastructure.

On that note, 2024 was a big year for clinical activity in general. According to Citeline data, nearly 5,000 industry-sponsored trials hit completion or primary endpoints—a 14.2% jump over 2023 and the largest annual increase in nearly a decade.

🔗 For a quick visual breakdown, check out Maryam Daneshpour’s graphic built from the report data.

Cover: Basecamp Research at CostaRica, GlobeNewswire media kit.

Read also:

'Google Maps' of Human Cells

13 Foundation Models: Startups, Industry Updates and the Nobel Prize

The "Holy Grail" of Digital Biology

Where Tech Meets Bio

Discussion about this post

Ready for more?