Navigating Data Scarcity: AI's Emerging Role in Biotech
ALSO: Big pharma hints that AI is already impacting clinical research; Medidata's Fareed Melhem Discusses AI's Impact on Clinical Trials; ethical and social challenges of AI in drug development
Hi! I am Andrii Buvailo, and this is my weekly newsletter, ‘Where Tech Meets Bio,’ where I talk about technologies, breakthroughs, and great companies moving the industry forward.
If you've received it, then you either subscribed or someone forwarded it to you. If the latter is the case, subscribe by pressing this button:
Now, let’s get to this week’s topics!
Navigating Data Scarcity: AI's Emerging Role in Biotech
‘Garbage in, garbage out’ is a well-known principle in the machine learning (ML) community, and it is certainly true when it comes to adopting ML-based methods in biotech and drug discovery.
According to a recent McKinsey report, ‘lack of high-quality data sources and data integration’ was named as one of the three key factors slowing down digitalization and data analytics in life sciences (the other two being a lack of cross-disciplinary talent and a lack of tech adoption at scale).
My own small poll on LinkedIn resulted in 52% of respondents favoring ‘lack of domain-specific data’ as the biggest challenge facing AI adoption in the biotech industry (a decent part of the respondents list seem to be subject matter experts, based on my brief review of their stated roles).
Tackling the problem of data scarcity
San Francisco-based ‘techbio’ company Atomic AI developed a tool to tackle the lack of data about RNA structures.
Atomic AI’s proprietary AI-driven 3D RNA structure engine, known as PARSE, generates RNA structural datasets, integrating machine learning foundation models with large-scale, in-house experimental wet-lab biology to unveil functional binders to RNA targets.
The company’s technology has the ability to predict structured, ligandable RNA motifs at unprecedented speed and accuracy, a key barrier to current approaches to RNA drug discovery.
Atomic AI plans to use its database of discovered and designed 3D RNA structures to develop a pipeline of rationally designed small-molecule drug candidates.
What is interesting Atomic AI is using so-called geometric deep learning, and can learn from very small RNA data.
Geometric deep learning is a subfield of machine learning that generalizes traditional neural network methodologies to data on non-Euclidean domains, such as graphs, manifolds, and complex networks. It seeks to understand data through its inherent geometric structures and relationships,
The method, called the Atomic Rotationally Equivariant Scorer (ARES), surpasses existing techniques in performance—even with training on just 18 known RNA structures. ARES's capacity to learn from minimal data addresses a significant challenge faced by typical deep neural networks. With its reliance solely on atomic coordinates and no RNA-specific details, this method has potential applications in various fields including structural biology, chemistry, and materials science, among others.
According to this Science paper, ARES operates without any predetermined ideas regarding the essential features of a structural model's accuracy. It doesn't come with any inherent understanding of double helices, base pairs, nucleotides, or hydrogen bonds. ARES's methodology isn't exclusive to RNA; it can be applied to any molecular system.
Instead of pre-defined specifications, the initial stages of the ARES network are tailored to detect structural patterns, learning their identities during training. Every layer calculates various characteristics for each atom, considering the spatial arrangement of adjacent atoms and the outcomes from the preceding layer. The only inputs for the initial layer are the 3D coordinates and the chemical element classification of every atom.
Zero-shot Learning
Another interesting example of tackling the data problem in biology was demonstrated by the US-based Absci, which focused on designing antibodies using AI.
Absci has pioneered a milestone in generative AI for drug development by being the the first (as they claim) to craft and verify therapeutic antibodies using zero-shot machine learning.
What's zero-shot?
It's a machine learning approach where a model is trained on certain categories of data and is then able to make predictions or classifications on entirely new, unseen categories, often leveraging the relationships between known and unknown categories. For example, if trained on images of horses, the model might be able to recognize zebras, even if it hasn't been explicitly trained on zebra images.
In Absci’s case, antibodies are designed to latch onto certain targets without any prior training data from known antibodies for those targets.
Why is this significant? The zero-shot model by Absci produces antibody configurations distinct from existing antibody databases, encompassing de novo versions of all three heavy chain CDRs (HCDR123), the antibody regions most critical to target binding.
How efficient is this approach? In tests against over 100,000 antibodies, Absci’s success rate proved to be between five and 30 times higher than established biological benchmarks.
Synthetic data
A quite innovative concept is the application of synthetic data to close the data gaps in those areas where real data is scarce. What is synthetic data?
Synthetic data is information that's artificially manufactured rather than generated by real-world events, but it has probability distribution similar to the real data. It, therefore, can be used for training machine learning models the same way as real data.
For instance, there is promising evidence that state-of-the-art synthetic data models can produce artificial versions of even highly dimensional and complex genomic and phenotypic data.
Researchers from Gretel.ai, in collaboration with Illumina’s Emerging Solutions, are investigating the possibility of generating synthetic versions of real-world genomic datasets. The synthetic data crafted by Gretel preserves the structure of the original dataset while ensuring increased privacy, allowing researchers open access without jeopardizing patient confidentiality. Initial studies on a sample of 1,220 mice have shown promising results, suggesting that synthetic data can potentially revolutionize data sharing in genomics. Gretel and its collaborators aim to further refine the scalability, accuracy, and privacy of synthetic genomics data in the future.
From Data to Treatment: Medidata's Fareed Melhem Discusses AI's Impact on Clinical Trials
As the potential of artificial intelligence (AI) continues to unfold, how is it revolutionizing the intricate landscape of clinical trials? How can vast reservoirs of data be harnessed to enhance patient experiences and accelerate drug discovery? Navigating the maze of global regulations presents its own challenges, but can these be surmounted with technological innovation? And in an era that emphasizes the importance of diversity and inclusion, how can we ensure that clinical trials truly reflect the diverse populations they serve?
In order to answer these and some other questions regarding the growing role of AI in clinical trials, I have talked to Fareed Melhem, Senior VP and Head of AI at Medidata, a Dassault Systèmes company.
Medidata, a provider of clinical trial solutions to the life sciences industry, recently announced a multi-year partnership expansion with Catalyst Clinical Research to support their global oncology brand, Catalyst Oncology. Through this extended partnership, Catalyst can continue to support over 150 oncology studies and manage more than 80 next-generation cancer clinical trials today across Phases I–III. Notably, around 90% of all oncology FDA approvals last year were developed using Medidata software.
Andrii: Fareed, the integration of AI into clinical trial processes is definitely a stride these days. Could you elaborate on how Medidata AI is utilizing artificial intelligence to further enhance the precision, efficiency, and success rates of clinical trials?
Fareed: At Medidata, we’re leveraging the power of AI to analyze the industry’s largest data set—so far, consisting of 30,000 clinical trials and more than nine million study participants—and to assist with the collection, analysis, and operational services for clinical trials. By better understanding and leveraging data, we can ultimately create detailed external control arms to allow for higher patient recruitment and retention, facilitate better study planning, select the most accurate sites and patients, and ultimately lead to better, faster, and safer trials.
The combination of Medidata’s technological solutions helps us improve patients’ experiences in clinical trials by reducing the number of patients who receive an outdated standard of care, better detecting adverse events, and bringing treatments to market—and more importantly, to patients—earlier.
Andrii: Given your extensive experience with international clients, how does Medidata adapt its technology and services to meet the diverse needs and regulations of global markets? Are there unique challenges or opportunities that you've encountered in different regions?
Fareed: What sets Medidata apart from other clinical trial providers is the immense breadth and depth of our data set and the cutting-edge technology that our customers can utilize to execute their studies. We’ve standardized this clinical and operational data so that biopharmaceutical, medical device companies, and contract research organizations across the world can use Medidata AI to derive relevant insights that inform key decision-making throughout their clinical trial program.This unique and robust dataset, combined with our over 20 years of industry experience and expertise, helps us to support the needs of our clients worldwide and address and comply with the different regulatory guidelines around the world.
Andrii: There is importance to real-time data access, scalability, and the capability for mid-study changes. How is Medidata staying ahead in understanding and anticipating the evolving needs of the industry and ensuring its solutions are not just current but future-ready?
Fareed: Our data set is the foundation for all of our work and is critical to addressing the needs of our customers as they arise. A prime example of this in action is how we use our product, Medidata AI Intelligent Trials, throughout the clinical trial process, from study planning to execution.
Whether it is determining realistic timelines, establishing diversity goals and enrollment sites, and identifying opportunities to improve operational performance, our customers are seeing the benefit of our services. In fact, our top 10 pharma clients experienced a 6+ month acceleration of their clinical trials in a hyper-competitive indication due to our products.
Being quick to understand and address the needs and common pain points of the clinical trial industry is crucial to continuing to be the leader of this complex and evolving landscape. In doing so, we hope to help customers get their treatments to patients faster and with fewer roadblocks.
Andrii: Beyond the technology itself, Medidata's solutions significantly affect the pace and success of drug development, ultimately benefiting patients worldwide. Can you share a particular success story or impact narrative that resonates with you, where Medidata's technology made a notable difference?
Fareed: A great example of the power of Medidata AI is our collaboration with Every Cure, a nonprofit dedicated to drug repurposing. We are working with Every Cure to use the power of Medidata AI to unlock new uses for existing medicines across all disease areas. Earlier this year, we were able to identify the most promising treatment for an individual living with idiopathic multicentric Castleman disease (iMCD), a rare, life-threatening condition.
Through this AI-guided discovery, the patient—who had exhausted all existing treatment options and was preparing for hospice care—was able to be successfully treated, providing hope to the individual and their loved ones, but also others living with the condition. This success story would not have been possible without the power of AI and reinforces our mission at Medidata, which is ultimately to power smarter treatments and healthier people.
Andrii: Can you share insight into any upcoming technological advancements or innovations within Medidata AI that will continue to enhance and expedite the clinical trial process? Maybe there are some major plans for 2024 you can talk about?
Fareed: A significant area where we are continuing to invest and innovate as a company is helping our customers to ensure greater diversity, equity, and inclusion within clinical trials. Earlier this year, we launched the Medidata Diversity Program, the industry’s most comprehensive solution to this historically prevalent challenge.
AI in particular has been very effective in helping us accomplish this goal. Medidata AI can promote greater inclusivity in trials by providing customers with baseline and benchmark data and actionable insights so they can integrate a more diverse patient population into their clinical trial program. It can also improve diversity in patient recruitment by identifying and screening potential participants and reducing biases found in manual recruitment processes.
For example, we collaborated with a large sponsor to benchmark their study’s diversity against that of industry performance. Our data and analytics showed the sponsor that they had a significant gap in their patient demographic make-up compared to the industry. We were then able to identify specific areas and means for improvement in order to increase diversity and ultimately increase the understanding of the therapy’s effectiveness in a larger population.
Big Pharma hints that AI is already impacting clinical research
Speaking more broadly about the growing role of AI in clinical trials, there is some evidence that it is already providing value. According to a recent Reuters report, Amgen's AI tool, ATOMIC, now scans vast data to rank clinics and doctors based on recruitment history, cutting enrollment time for some mid-stage trials by half. By utilizing ATOMIC, Amgen aims to shorten the typical drug development timeline by two years by 2030 (link to report in the comments)
Novartis also leverages AI to expedite patient enrollment in trials, making the process faster, cheaper, and more efficient. However, AI is only as good as the data it is trained on. With only about 25% of healthcare data available globally for research purposes, there are still limitations.
Bayer utilized AI to decrease participant numbers in a late-stage trial for Asundexian. Specifically, it used AI to bridge mid-stage trial findings with extensive real-world data from millions of patients across Finland and the US, facilitating the forecasting of long-term risks among a population analogous to the trial. Bayer plans to use real-world data for an external control arm in a pediatric study of the same drug.
According to the Reuters report, Blythe Adamson, PhD, MPH, a senior principal scientist at Roche subsidiary Flatiron Health, emphasized how AI enables rapid and large-scale analysis of real-world patient data, contrasting it with traditional methods, which could take months to analyze data from 5,000 patients, whereas now millions of patients' data can be analyzed in just a few days.
Smaller companies are also applying AI to boost clinical research
Recently, I wrote about how Lantern Pharma Inc. (Nasdaq: LTRN) is leveraging their AI-driven platform RADR®, now available for pilots and collaboration programs, to build a portfolio of repurposed oncology assets now in phases 1-2.
In one case, RADR® was able to predict the drug's response with an accuracy rate of 88% across all solid tumors being tested for Elraglusib; using those models, they were also able to predict sub-populations of melanoma patients that may benefit from Elraglusib (link in the comments).
Another AI platform, the inClinico system by Insilico Medicine, was validated through three methods in a study evaluating AI's prediction accuracy on Phase II trial success. This transformer-based platform, utilizing generative AI and multimodal data, was trained on 55,600 unique Phase II trials over 7 years. Insilico's developed model showcased 79% accuracy in predicting real-world trial outcomes in the prospective validation set.
Achieving Data Diversity Through AI in Drug Development
I have stumbled upon an insightful article from BioSpace on the ethical and social challenges brought forth by the implementation of AI in drug discovery.
Here is a summary of key takeaways worth your attention:
1. Scaling Up Drug Discovery: Suchi Saria from Johns Hopkins University highlights AI’s potential to scale up the search for drug targets, emphasizing the need for more open and inclusive clinical trials that better reflect the diversity of real-world populations.
2. Enhancing Fairness and Equity: Kim Branson from GSK points out AI’s role in identifying who might respond best to specific medications, increasing fairness and equity in healthcare. This is particularly beneficial for the socioeconomically disadvantaged, with 49% reporting insurance refusals for drug coverage.
3. Reducing Drug Discovery Costs: The use of AI in drug discovery could significantly cut down the staggering costs associated with bringing a new drug to market, estimated at $2.3 billion. This reduction could potentially lead to more affordable drug pricing, as explained by Daphne Koller from Insitro.
4. Addressing Privacy Concerns: Despite legal protections like HIPAA and GDPR, patient privacy remains a major concern with healthcare data breaches on the rise. Ensuring robust anonymization of data and strengthening the consent system are vital steps toward safeguarding patient information.
5. Ensuring Representative Data: To avoid biased outcomes in drug development, it is crucial to have diverse and representative data sets. Abdoul Jalil Djiberou Mahamadou from Stanford University highlights the over-representation of certain demographic groups in clinical trials, calling for a more inclusive approach in data collection, especially from low- and middle-income countries.
6. Fostering Ethical Practices: Engaging clinicians, ethicists, and other stakeholders in the data collection and drug development process is essential to upholding ethical standards, especially in regions with no existing legislation on privacy and ethics.
7. Committing to Global Data Diversity: Efforts are underway by databanks and pharmaceutical companies to diversify patient data and ensure global representation. Initiatives like ‘Our Future Health’ by the UK Biobank and the ‘All of Us’ project in the U.S. are actively working towards this goal.






