Addressing Data Bottlenecks in the Era of AI-driven Drug Discovery
There is an important component of drug discovery "industrialization" which is still largely missing. Thanksfully, a group of companies are bringing about the needed change.
This week’s Where Tech Meets Bio is sponsored by Syntekabio (KOSDAQ: 226330), a South Korean leader in AI-driven drug discovery.
Syntekabio's STB CLOUD is a fully automated, cloud-based AI platform aimed at accelerating early-stage drug development. Traditionally, this process takes five years or more, but with STB CLOUD, developers can reach the pre-clinical stage in just about two years.
The platform overcomes five technical difficulties in the AI drug development process: virtual screening, AI-deep learning, MD simulation, pocket diversity, and ADME/Tox.
STB CLOUD integrates heterogeneous hardware and hundreds of bio software, utilizing Kubernetes and Docker containerization for seamless operation. Its core product, DeepMatcher™-Hit, drives the platform by screening billions of compounds for target proteins and deriving active substances. STB CLOUD minimizes the need for specialized personnel, as it automates the entire process, allowing users to find small molecular drug candidates with high binding affinity within several weeks.
The platform's commercialization will include the addition of NEO-ARS™, a neo-antigen discovery platform, as well as other drug development platforms in the future. By simplifying and automating drug development, STB CLOUD enables innovations and expands access to global researchers.
Read "The Power of STB CLOUD in Remote AI Drug Development".
Ok, now, let’s get to the topic of today’s newsletter.
If you are a biologist or a drug hunter and you haven’t read a blog by Dr. Vijay Pande, a general partner at Andreessen Horowitz (a16z) and founding investor of a16z’s Bio Fund -- you should probably do it.
Apart from being an influential venture capitalist in the biotech space, having companies like Insitro and BioAge in the fund’s portfolio, Dr. Pande is regularly sharing his vision of how technologies are changing drug discovery and biotech research. According to him, we are in the middle of an industrial revolution happening in drug discovery and biotech, the driving force of this change being artificial intelligence (AI), machine learning, and automation.
Hard to disagree with such a vision.
Indeed, there is a growing wave of companies building drug design platforms of new generation -- Recursion Pharmaceuticals (NASDAQ: RXRX), Insitro, Exscientia (NASDAQ: EXAI), Insilico Medicine, Deep Genomics, Valo Health, Relay Therapeutics (NASDAQ: RLAY), you name it -- companies that create highly integrated and automated AI-driven and data-centric drug design processes from biology modeling and target discovery, all the way to lead generation and optimization (sometimes referred to as “end-to-end” platforms) . I wrote about this trend in one of the articles in a series about AI in Drug Discovery: A New Breed of Biotechs Takes the Lead in AI Drug Discovery. These “digital biotechs” are trying to transform traditional drug discovery, a notoriously bespoke, artisan process, into a more streamlined, repeatable, data-driven process -- more resembling an industrial conveyor line for drug candidates.
Announcements over the last couple of years by Exscientia (NASDAQ: EXAI) (here), Deep Genomics (here), Insilico Medicine (here), and other companies point to a situation where the average time for an entire preclinical program -- from building disease hypothesis to official nomination of a preclinical drug candidate -- have shrunk down to timelines as short as 11-18 months, and at fraction of costs of a typical project of similar nature conducted “traditionally”. Rapid timelines are achieved in drug repurposing programs with previously known drugs or drug candidates, for example, using AI-generated knowledge graphs, e.g. BenevolentAI (AMS: BAI) in their Baricitinib program, or advanced multiomics analysis and network biology to derive precision biomarkers for better patient stratification and matching novel indications -- as Lantern Pharm (NASDAQ: LTRN) does to rapidly expand their clinical pipeline.
However, a lot of those AI-driven “digital biotechs” are still relying on community-generated data to train machine learning models, and this may come as a limiting factor. While some of the leading players in the new wave, such as Recursion Pharmaceuticals and Insitro, are investing heavily into their own high-throughput lab facilities to get unique biology data at scale, other companies appear to be more focused on algorithms and building AI systems using data from elsewhere, and only having limited in-house capabilities to run experiments.
A common practice is to use community-generated, publicly available data. But it comes with a caveat: an overwhelming majority of published data may be biased or even poorly reproducible. It also lacks standardization -- conditions of the experimentation may differ, leading to a substantial variation in data obtained by different research labs or companies.
A lot has been written about it, and a decent summary of the topic was published in Nature: “The reproducibility crisis in the age of digital medicine”. For instance, one company reported that their in-house target validation effort stumbled at their inability to reproduce published data in several research fields. The obtained in-house results were consistent with published results for only 20-25% of 67 target validation projects that were analyzed, according to the company’s report. There are numerous other reports citing poor reproducibility of experimental biomedical data.
This brings us to a known bottleneck of “industrializing drug discovery”: the necessity for large amounts of high quality data, highly contextualized, properly annotated biological data that would be representative of the underlying biological processes and properties of cells and tissues.
In order for a wide-scale industrialization of drug discovery to occur, the crucial thing is the emergence of widely adopted global industrial standards for data generation and validation -- and the emergence of the ecosystem of organizations which would be “producing” vast amounts of novel data following such standards. Then, large drug makers and smaller companies would be able to adopt AI technologies to a much deeper extent. If we take the automotive industry as an example, a component of, say, an engine, developed in one part of the world would often fit into a technological process line in the other part of the world. So, highly integrated processes can be built across geographies and companies, as a “plug-and-play” paradigm.
The same approach is required in the preclinical research in drug discovery: every lab experiment, every data generation process, every dataset generated, all must be “compatible” with all other research processes, machine learning pipelines, etc. -- across the pharmaceutical and biotech communities globally. When this tectonic shift occurs, we will witness a truly exponential change in the performance of the pharmaceutical industry, something I would call “commoditization” of preclinical research.
There is, luckily, a growing number of companies that are starting to bring about the required change in how preclinical research is done. Companies that build standardized, highly automated, scalable, and increasingly compatible laboratory facilities, guided by AI-based experiment control systems, and supplemented by AI-driven data mining and analytics capabilities. Such “next gen” lab facilities are often available remotely, making preclinical experimentation more accessible to various players in a wider scope of geographies.
One of the presentations during the November 2021 conference by Deep Pharma Intelligence was delivered by Dr. Martin-Immanuel Bittner, Co-founder and CEO of Arctoris -- the Oxford-based biotech platform company powered by robotics and data science. One slide (below) caught my particular attention as a quite illustrative way to summarize where the drug discovery industry is heading in terms of overcoming the data quality and availability bottleneck (I have moved Insilico Medicine to the upper-right quadrant as they have opened their own robotized facility recently).

So, the industry is shifting from the “traditional” scientific paradigm at the heart of the largest and most established corporations (the left-bottom quadrant) towards AI-first R&D models. A good old “design-make-test” cycle involves numerous iterations and is mostly controlled by humans, with a lot of manual work, disconnected processes -- all creating inefficiencies and additional cost.
There are a number of companies at the cutting-edge of AI research (upper-left quadrant), Atomwise, BenevolentAI, and others -- which are developing cutting edge algorithms, model architectures, and highly connected modular systems for mining and modeling data to rapidly discover novel targets, leads, biomarkers, etc. Although such companies may have internal data generation workflows of their own and even wet lab facilities, raw data generation is usually not a central focus, and such companies mostly rely on external public and private data sources to train models.
Another cohort of companies (bottom-right quadrant), such as Strateous, and Emerald Cloud Labs, represent remote-controlled autonomous laboratories with cloud data infrastructures, where experiments can be run from around the globe. In contrast to "algorithms-focused" companies, they offer a standardized and scalable way to generate massive amounts of experimental biomedical data and represent a new way to outsource lab experiments. The same quadrant is shared also by synthetic biology companies such as Ginkgo Bioworks and Zymergen, mainly due to their abilities to generate massive amounts of biology data, among other things. Both types of companies in this quadrant have data generation via experiment automation and lab robotization as a cornerstone of their business model. I would call this category "Data-as-a-service" companies.
Finally, there is the upper-right quadrant which encompasses a small number of companies, such as Arctoris, Recursion Pharmaceuticals, Insilico Medicine, and Insitro, uniting the two worlds -- biology experiment automation/robotization, and cutting edge data science and AI systems.
For example, according to Dr. Martin-Immanuel Bittner, Arctoris is capable of generating vast amounts of robust, reach, and better-contextualized data for own research and for select R&D partners -- in a highly automated and standardized way (watch the full presentation here). This allows minimizing human intervention in the “design-make-test” cycle, hence, dramatically streamlining and accelerating the whole research work, as compared to “traditional” drug discovery. The highly annotated quality data generated by companies like Arctoris or Recursion Pharmaceuticals can be a game-changer for the progress in the development and adoption of pharmaceutical AI by many organizations.
Democratizing biomedical science in academia
According to recent news, Carnegie Mellon University is preparing to launch the world’s first university cloud lab in early July, aiming to revolutionize the way science is conducted in academic environment. The Cloud Lab will provide students and faculty with remote access to over 200 different lab instruments, enabling them to code and submit their AI-assisted experiment parameters from anywhere in the world. Trained technicians and robots at the facility will execute the experiments, eliminating the need for on-site presence.
Emerald Cloud Lab, a private research facility founded by two CMU alumni, will assist in running the university's Cloud Lab until faculty members become proficient in using the technology. The Cloud Lab concept democratizes science by providing researchers from under-resourced institutions with the opportunity to conduct experiments that would otherwise be unattainable. It also offers a more accessible research platform for individuals with disabilities.
The Cloud Lab significantly streamlines the scientific method, as scientists can review and modify experiment codes from a centralized lab, accelerating progress and reducing errors associated with replication. Additionally, the shared use of instruments maximizes the potential of expensive equipment while ensuring cost-effective maintenance. Carnegie Mellon University's Cloud Lab represents a significant step towards more inclusive, efficient, and cost-effective scientific research.





