enlightenbio  Blog

Neither AI nor ChatGPT Will Save Your Drug Discovery Pipeline …

AI is having a tough time generalizing molecule structures when your data is more scarce than the complexity of your problem allows for

“Boring is good” is the mantra of successful engineering. This is in contrast to modern-day biotech which currently is at the breaking point of its third hype-wave, as discussed by Bender and Cortés-Ciriano (2020). Drunk on Artificial Intelligence (AI) and ChatGPT party tricks, it has fallen in love with a dangerous assumption: advancements in deep learning, chatbots, and creative arts must also indefinitely translate into tools for scientific advancement. Fundamentally, there is nothing wrong with hype cycles governing our lives; they’ve always been part of human nature, as described for example by the Gartner hype cycle. Problems arise when dreams and visions start to detach from our fundamentally limiting human condition and physical reality.

“Data is the new oil” is one of the recent hypes built up through the echo chambers that are the hallways of industry, academia, and government. Truth be told, this hype has a strong theoretical underpinning in math and statistics (Dawid, 2020), but the devil is in the details.

“Data is not created equal.” The usefulness of data is tightly connected to the purpose it was collected for. The hype of Big Data and Data Mining suggested that all truths of the world can be discovered in a pile of randomly collected data. The reality is that when accuracy is required — as usually is the case in science — the way data is collected is crucial to whether it can answer the question(s) at hand (Note: causal questions require a separate treatment; see Appendix). Furthermore, the higher the complexity, the more data will need to be collected. This also follows the ‘Curse of Dimensionality’: A theorem established in statistics, demonstrating that with every additional dimension considered for a problem, the amount of data required to cover that extra space increases exponentially.

“Choose your battle wisely” is a core tenet of Google DeepMind’s strategy since its inception in regard to matching data with impactful problems:

  1. Beating Atari games is easier than beating Go.
  2. Beating Go is easier than discovering new crystals (Merchant et al., 2023).
  3. Discovering new crystals is easier than solving drug discovery (Or is it? See Buonassisi, 2023).
  4. Solving drug discovery on a chemical basis is easier than solving drug discovery on a biological basis (see 7, Table 1).

Table 1:. Showcases how much harder drug discovery in the biological space will be compared to protein folding exercises (Data Source: Bender and Cortés-Cirian, 2020)

It is entirely natural to choose easier battles first, divide and conquer, and refine your methods on smaller problems. As part of the aforementioned human hype cycle, though, any success on chess, Go, or protein folding is irrationally extrapolated onto ever more complex and exciting problems (translation for investors: more opportunities to gamble on the monopolies of the next decade). As an example, listen closely to the leaps of faith that drive research labs in drug discovery across the world (Jaderberg, 2024).

Unfortunately, any extrapolation will eventually lead to wrong predictions when deeper fundamental limits are violated. Currently, we are hitting the limits of our current datasets much faster than we can collect new data. Often, data is not even mentioned anymore and is considered a ‘solved’ problem (Jaderberg, 2024).

“This time it’s different” are the four most dangerous words in investment (Sir John Templeton). They essentially encapsulate the above fact that extrapolations eventually always become wrong and misleading. When they fail, their induced hype crashes back down onto the fundamental bedrock of human existence. In the case of biotech, the bedrock is the painful fact that all statistical modeling is fundamentally limited by the data collected, its signal-to-noise ratio, and representational relevance to the task at hand, e.g., predicting protein folding from a sequence of letters. (Note: all arguments made in this article apply to any problems and their representations in biology, whether black-box or not (Bronstein and Naef 2024).

“If data is the new oil, then what’s our drilling equipment doing so far?” 

In a sea of Nobel Prize announcements, billion-dollar investment rounds, and exploding GPU prices (reminiscent of the Railway Mania discussed in Alisdair Nairn, 2018), collecting data with disciplined rigor and attention to detail seems to have become an afterthought. A closer look might reveal that the scientist collecting the precious protein folding data should have had a bigger mention in the recent Nobel Prize announcements, but in a hype cycle obsessed with modeling, data is ironically considered cheap, when in fact, it’s the most expensive resource on planet earth.

“The data-wells are running dry, fast.” Regardless of how many NVIDIA GPUs you have bought (sometimes recursively as discussed by Dey et al., 2023), if you run out of data, no smarter modeling or faster GPU cluster will save you, or your Odyssey to discover the next billion dollar molecule (see Berry Werth 1994). Don’t get me wrong: AlphaFold3 and others are impressive engineering successes, but the data oil fueling them is dwindling very fast. 

“The prediction-only era is coming to an end: We need to start systematically planning how to efficiently acquire new data to support our prediction efforts. It is inevitable that we have to start thinking about systematic data acquisition. The path to better drugs leads through smarter data collection.”

“We are reaching the end of the model-first paradigm,” which has propagated damaging behavior and expectations of what is considered prestigious science. It’s easier to spin up a jupyter notebook and print matplotlib graphs (using the same old data, ironically) than to experiment in a lab. The experimenters doing the “dirty” experimenting are depressed by how slow laboratory discovery seemingly is and how well paid Machine Learning (ML) engineers are. Naturally, wet and dry lab chemists would much rather take on the lifestyle of pushing a few buttons in an air-conditioned office than pipetting or breathing fumes in a costume. Humans follow incentives, which Charlie Munger famously described as “Show me the incentive, I’ll show you the outcome.”

Figure 1: Active Learning is one of many emerging techniques adopted in Pharma and other industries to accelerate experimentation, and build stronger data moats, faster.

Neither AI nor ChatGPT will save your drug discovery pipeline … So who will? 

Truly, it’s the scientists, equipped with the appropriate active learning algorithms and lab automation for maximized throughput. This answer can be derived from a painful journey back to square one, re-calibrating with the limits that science dictates: to realize that fundamentally, long-term strength lies in data moats (which governments should consider investing as argued by Robson Beaudry, 2024) Conceptually, data is a consequence of the most fundamental task of science: experimentation.

Experimentation is the path towards discovery; data, and the resulting insights, is simply the medium facilitating experimental progress. Arguably, that is the key reason why NVIDIA, and also TSMC and ASML, went on to dominate the GPU market as they embraced experimentation as their key philosophy in their engineering; see Jenson Huang’s sermon to Recursion. (Note: It’s a sermon because he is using religious language, e.g., at minute 22:30: “there is always that leap of faith, that is necessary […] someone has got to do it.”) The leaders of scientifically-focused industries in the next decade — whether pharma, materials or finance — will be discerned by their investments into experimentation. Equipping their talent pools with the best experimentation tools and training to execute their scientific tasks will enable them to build their individual data moats. In the casino of science, smarter gambles provably lead to the best chances of success.

ML delivers smarter gambles: Active Learning enables clever and maximal exploration of any complex representation space in question. 

For example, Roche has to decide whether to clone their experimentation pipeline to double their throughput, at a cost of several million USD. Instead, they invest in smarter exploration and exploitation algorithms, to maximize the information they can gather (Sin et al., 2024). Merck KGaA is building their own internal experimentation software to maximally exploit the data their lab can deliver. Yoshua Bengio is a big fan of Active Learning. Lab Automation delivers experiments, more often: Closed-loop optimization is freeing the scientist to focus their precious and expensive time on their true passion: the science. Bets are already being made. Canada has made a CAD 160 million bet on lab automation for materials, chemistry, and beyond.

Hype Cycles Are a Choice

Truth be told, again, hype cycles are a choice. And the reality is that precious talent and resources are wasted if recalibration towards more fundamental limits like data collection starts too late. Don’t get me wrong! I am not saying experimentation will solve all your problems. I am saying you can’t build a house without experimentation as a foundation. Houses with shoddy foundations tend to not last long. Modeling and computation are of course, also important ingredients in the pursuit of prediction-driven science. They just are strategically and technically second in line and influenced by the experimental foundation.

Algorithms at the heart of Machine Learning-driven experimentation

At Matterhorn Studio, we build algorithms for drug-discovery experimentation in labs across the world that help scientists deliver pivotal results. On the 2nd of December 2024 we had 180 attendees at our inaugural symposium for “Active Learning in Pharma,” discussing challenges, adoptions, and next steps. Our next symposium is scheduled for November 2025 – sign up now for our 2025 symposium  and watch 5 minute video summaries for active learning teams in Merck KGaA, Novo Nordisk, MSD, Evotec, and more.

Appendix

Disclaimer: I did not argue against modeling or expanding compute, I did not argue against deep learning or building supercomputer clusters. I am simply observing the limits we reach on these dimensions of ‘data science.’ how most low-hanging fruits are gone, as the fundamental limit of data is becoming more obvious. I am arguing for a rebalancing and recalibration, a return to deeper fundamentals. Modeling and compute will continue to be key factors, but need to make sure our data collection from experimentation is equally invested in. I am also aware that drug discovery is often seen as the first and most crucial stage. For Pharma, saving costs to get drugs to market faster is the key challenge. Recent hype suggests that that goal should be tackled in the first stage. I argue that faster drug discovery and development are both dependent on disciplined experimentation. The value chain after the initial hit finding has plenty of potential for improvement that often is considered boring. But boring is good.

Causality, or not: Another decision investigators need to answer is whether your problem at hand is of causal nature or not. This concept of ‘causality’ is an emerging paradigm with implications worth another article, but put simply: Most statistical modeling, by definition, does not allow for causal conclusions. Unless explicitly stated to do so, for example,no deep learning model can distinguish whether more ice cream is sold in the summer because it is hot, or it is hot because people buy more ice cream. Causal Inference is a statistical language developed to deal with these causal questions and is finding rapid adoption across many experimental and non-experimental fields.

References

Barry Werth, The Billion Dollar Molecule (1994)

Bender A, Cortés-Ciriano I, Artificial intelligence in drug discovery: what is realistic, what are illusions? Part 1: Ways to make an impact, and why we are not there yet, Drug Discovery Today, Volume 26, Issue 2, 512-524 (2021). https://doi.org/10.1016/j.drudis.2020.12.009. There is also an accompanying YouTube talk available.

Beaudry R, Data is Infrastructure (2024)

Buonassisi T, On Characterization of “Novel Materials” from High-Throughput & Self-Driving Labs (2023)

Bronstein M and Naef L, The Road to Biology 2.0 Will Pass Through Black-Box Data (2024)

Dey A, Heese J, Srinivasan S, and Lobb A. Ginkgo Bioworks vs. Scorpion Capital: The Debate Over Related-Party Revenues (2022, revised May 2023) Harvard Business School Case 123-037

Jaderberg M, How AI is saving billions of years of human research time, TEDAI Talk San Francisco, (2024) Comment: Unfortunately, data is only discussed once, and its scarcity reasoned away with “We’re always creating new ways to record and measure every detail of our real messy world, that then creates even bigger datasets, that helps us then train even richer models.” It’s reminiscent of the Big Data Mining paradigm where purposeful collection of low noise data is seen as a solved bottleneck.

Merchant, A., Batzner, S., Schoenholz, S.S. et al. Scaling deep learning for materials discovery. Nature 624, 80–85 (2023). https://doi.org/10.1038/s41586-023-06735-9

Nairn A, Engines That Move Markets: Technology Investing from Railroads to the Internet and Beyond” (2018).

Sin JW, Chau SL, Burwood RP, Püntener K, Bigler R, and Schwaller P. Highly Parallel Optimisation of Nickel-Catalysed Suzuki Reactions through Automation and Machine Intelligence. ChemRxiv. 2024; doi:10.26434/chemrxiv-2024-m12s4 This content is a preprint and has not been peer-reviewed.

Dawid P, Decision-theoretic foundations for statistical causality (2020) https://arxiv.org/abs/2004.12493

Jakob Zeitler – enlightenbio guest blogger.

Jakob Zeitler, PhD, is a Pioneer Fellow at the University of Oxford. He actively researches Machine Learning and Causal Inference, their practical benefits, and their limitations. As a DeepMind Scholar during his PhD at University College London, he interned with Spotify Research UK, yielding two patent filings. At Matterhorn Studio, Jakob leads the research of algorithms at the heart of experimentation across Pharma, Chemistry and Finance. Learn more about Active Learning and its impact in Pharma.

Jakob Zeitler - enlightenbio Guest blogger

ADVERTISEMENT

Discover more from enlightenbio Blog

Subscribe now to keep reading and get access to the full archive.

Continue reading