Cloud for AI/ML & Modern Data Science

The Bio-IT World Conference & Expo, taking place in Boston, April 2-4, has grown into a highly regarded, must-attend conference for life scientists, clinical care, and IT professionals. The program focuses on new trends in data generation, knowledge management, and information technology in life sciences and drug development. There are many interesting and exciting tracks one would not want to miss.

I am excited to have been offered the opportunity to chair the “Cloud for AI/ML & Modern Data Science” session on Day 2 of the conference. This topic is obviously not only exciting but also highly relevant, as in recent years, we have seen more and more companies introducing AI into their workstreams and products, due to innovations and the future promise of AI leading already to breakthroughs across the healthcare industry, including in drug discovery, drug design, and clinical development, but also across laboratory and business process management.

AI Is Driving Value When It Comes to Complex Scientific and Statistical Properties

AI has demonstrated again and again that it can really help create things faster, more efficiently, and in many instances even better. It also has been shown that AI can be applied to a process via an iterative cycle to continuously improve it. In the healthcare industry, there are many applications where data is generated at scale and of sufficient quality where AI is able to assist in improving these processes. AI’s profound impact is most notable in tackling intricate scientific and statistical challenges beyond human capacity. For instance, AI excels in predicting intricate protein structures and functions, as well as accurately delineating the roles and functions of diverse therapeutics with precision. Its integration spans the entirety of the therapeutic value chain, extending beyond drug discovery to encompass all facets of scientific exploration and pharmaceutical advancement.

“AI has demonstrated again and again that it can really help create things faster, more efficiently, and in many instances even better. It also has been shown that AI can be applied to a process via an iterative cycle to continuously improve it. In the healthcare industry there are many applications where data is generated at scale and of sufficient quality where AI is able to assist in improving these processes.”

Image source: How AI Accelerates Drug Development by Eric Fish

The Next Wave of AI’s Impact is on Understanding the Interactions between different classes of Molecules

Last year’s Nobel Prize in Chemistry was awarded to Google DeepMind (Demis Hassabis, John Jumper) and the University of Washington (David Baker) for their groundbreaking work in predicting and understanding proteins’ complex structure starting with DNA sequences. This achievement marks just the beginning of a new era in scientific advancements. Looking ahead, the field is poised for further breakthroughs as researchers delve into other molecular modalities like RNA and DNA molecules, as well as the modification of residues such as glycans and covalent ligands. The key focus now shifts towards comprehending the intricate interactions among these diverse classes of molecules. By unraveling these molecular interactions, scientists aim to enhance their understanding of biological reactions and processes. This deeper insight will pave the way for more effective drug development strategies, enabling targeted interventions with a high level of predictability in their outcomes.

Drew Dresser, Director, Cloud Engineering, Flagship Pioneering: “AI methods like AlphaFold, DiffDock, Boltz-1, ESMFold, RFDiffusion are revolutionizing biological design. We’re moving beyond simple predictive modeling into truly generative approaches – enabling us to design therapeutic molecules, proteins, antibodies, and even entire cell therapies from scratch with unprecedented accuracy. This coupled with cloud native bioinformatics and data lake architectures will help organizations seamlessly integrate disparate datasets and reduce time-to-insight.”

Large Language Models

ML and AI have a long history, showcasing maturity over the years, particularly when compared to the relatively new Large Language Models (LLMs). Emerging in the last two years, LLMs stand out in tasks related to Natural Language Processing (NLP). They excel in text generation, language translation, summarization, chatbot operation, and conversational engagement, among other applications. In essence, LLMs fall under the category of Generative AI (GenAI), specifically focusing on creating human-like text.

Initially tailored for NLP functions, LLMs have found a significant niche in modern biology, where sequential data plays a vital role. This is evident in fields like bioinformatics, where LLMs, such as protein language models (pLMs), are specifically trained to extract insights from protein sequences. These models prove instrumental in interpreting sequential data patterns, simplifying user interactions by enabling information retrieval without the need for extensive programming skills, especially in large-scale scenarios.

Among the diverse range of LLMs, ChatGPT emerges as a prominent player. The launch of ChatGPT-3 in November 2022 marked a significant milestone, propelling LLMs into the spotlight and sparking a surge of interest and investment in generative AI technology. The versatility and adaptability of LLMs, along with their user-friendly interface, indicate a promising future, solidifying their presence in various industries.

Karthik Sekar, PhD, Staff Scientist, Invert, Inc.: “I’m excited about foundation modeling taking root in bioprocessing. We’ve seen foundation models for protein structure and function, but it’s been elusive for bioprocessing. This means that we’re going to see innovations on the discovery side but still lag on a process to scale it up. As we start to get better at data infrastructure, it’ll be easier to compile bioprocessing data and pursue more foundation modeling efforts.”

Agentic AI Systems

Agentic AI systems are another layer of abstraction beyond LLMs, enabling automation and complex task reasoning. In the realm of clinical trials, a domain fraught with intricate processes and administrative challenges, Agentic AI offers a transformative solution. By leveraging Agentic AI on top of LLMs, the system can tackle critical issues such as defining inclusion and exclusion criteria, patient enrollment, participant screening, study procedure and data collection definition, real-time analytics for cost and time efficiency, and expedited responses for targeted patient care. This integration paves the way for streamlined and efficient trial management, enhancing the overall quality of patient care.

Agentic AI stands out for its ability to operate autonomously, tackle challenges, adjust to evolving circumstances, and assimilate insights from its interactions. Empowered with NLP and diverse AI functionalities, Agentic AI functions on behalf of users, comprehending information sourced from databases, sensors, and interfaces. Beyond making decisions rooted in its comprehension, it evolves and refines itself through continuous learning from feedback and encounters.

In the realm of healthcare, Agentic AI plays a pivotal role in early disease detection through the analysis of patient data.

Multimodal AI

In 2024, we have transitioned into a multimodal era where AI models can now process and integrate data from various modalities. This advancement allows for the integration of different types of data (i.e., different modalities of data), including text, audio, image, sensory, and video data, leading to a more comprehensive understanding of information. The next frontier involves integrating diverse data to facilitate generative and integrative engineering for predicting and measuring the performance of medical practices using multimodal AI.

The crucial aspect lies in utilizing the appropriate AI model or a network of models tailored for specific applications and timing. Understanding the problems to be addressed and the sequence in which they should be tackled is vital in selecting the right AI model. These models serve as tools to achieve desired outcomes, forming an ecosystem that must be strategically employed based on the application’s requirements. Additional models such as diffusion models and graph neural networks play also significant roles depending on the specific challenges at hand. In healthcare, for instance, multimodal AI applications prove valuable in analyzing medical images and patient data effectively.

Data Is the Fuel by Which AI Models Work

When delving into life science data, the processes of procuring, generating, or utilizing existing data from the web all play crucial roles. Data serves as the fundamental fuel that drives the effectiveness of AI models. However, prior to delving into the data aspect, it is imperative to first establish a clear and concise goal for the endeavor:

Define the specific areas of learning or study.
Identify the core issue at hand.
Formulate the hypothesis to be explored.

Understanding these key elements sets a solid foundation for navigating the realm of life science data and leveraging it effectively in AI model operations.

To ensure accurate data collection and effective analysis, it’s crucial to outline all necessary steps:

Identify the required data sets for analysis.
Define data quality requirements.
Define data collection methodologies.
Determine the sources of data, whether from public repositories or other channels.
Specify the type of data to be obtained via different technologies, for example, biometric data.

Establishing these parameters will guide the selection of the most suitable ML or AI approach/model. The emphasis should always be on a science-first methodology. Rather than preselecting an AI model and seeking data to fit, it’s essential to first define the data needs and then align the approach accordingly. This ensures a robust and data-driven analytical process.

Truly, it is an AND (Data), AND (Algorithms), and AND (Scale) approach—all play hand in hand.

Innovations happen in all these places, and this is really how we should be thinking and focusing our efforts and time.

Karthik Sekar, PhD, Staff Data Scientist, Invert, Inc.: “The biggest challenges are on the data infrastructure side. Most bioprocessing companies that we work with have data strewn across different systems: spreadsheets, historians, instrument computers, etc. Standardizing and consolidating the collection remains a key challenge.”

Drew Dresser, Director, Cloud Engineering, Flagship Pioneering: “The biggest challenges are associated with operationalizing AI/ML at scale – building AI/ML deployment frameworks and ML Ops Playbooks to support different model inference architectures like Kubernetes, SageMaker, AWS Batch, etc. This is a work in progress for us.”

The Bio-IT “Setting Up and Scaling Agile Data & Analytics Ecosystems” Session within the Cloud AI/ML & Modern Data Track

The observed increase in AI and ML solutions in real-life situations over the past few years is exciting but also challenging. In big companies, these solutions have to be implemented in hundreds of use cases, and it’s difficult to do this manually. To ensure successful deployment, the adoption of scalable cloud infrastructure and specialized hardware is essential to meet the demanding requirements of AI for extensive data processing and storage. This necessity is particularly evident during AI model training, where substantial data volumes and robust processing capabilities are crucial for generating results.

Certain AI tasks, like deep learning, even demand specialized hardware like GPUs (Graphics Processing Units) and TPUs (Tensor Processing Units) for accelerated processing. Moreover, protecting sensitive cloud-based data involves enforcing strict security protocols through established security frameworks and compliance measures to prevent breaches and ensure adherence to data privacy regulations.

A noticeable trend is the shift towards cloud-based data fabric architectures, facilitating rapid technological innovation, enhanced agility, sustainability, and the seamless connection, management, and governance of data across diverse systems and applications. This approach provides a unified and centralized view, enabling smooth data access, sharing, and governance while promoting efficient data management practices.

“…the adoption of scalable cloud infrastructure and specialized hardware is essential to meet the demanding requirements of AI for extensive data processing and storage.”

Moreover, within the corporate realm, the implementation of AI solutions and machine learning models necessitates operationalization. While data scientists and ML engineers possess the necessary tools for model creation and deployment, the crucial phase lies in transitioning these models to production for addressing real-world scenarios. An established framework or methodology is essential to minimize manual intervention and streamline the deployment process of ML models.

The “Cloud for AI/ML & Modern Data Science” session will delve into case studies and best practices, offering insights into the selection of optimal cloud or hybrid infrastructure and respective AI applications. The objective is to drive research and development initiatives, promote collaboration and innovation, and uphold the requisite adaptability to match the evolving technological landscape influencing pharmaceutical R&D.

I am excited to share the stage with the following speakers:

Drew Dresser, Director, Cloud Engineering, Flagship Pioneering – Scaling AI/ML in Biotech: A Survey of Cloud Trends and Innovations
Gregory Hinkle, PhD, Vice President, Research Informatics, Alnylam Pharmaceuticals, Inc. – Cloud Genetics: A Blueprint for Precision Medicines
Yohann Potier, PhD, Senior Director, Data Platform, Tessera Therapeutics, Inc. – Flexible Architecture for Machine Learning for Genomics
Evan Floden, CEO & Co-Founder, Seqera – Powering AI/ML Workloads and Scaling Science with Nextflow
Aaron Jeskey, Senior Cloud Architect, Cloud Engineering, Pinnacle Technology Partners, Inc. – AI/ML on AWS: Building for GxP Validated Environments
Karthik Sekar, PhD, Staff Data Scientist, Invert, Inc. – Towards Foundation Models for Process Development
Paul Brake, Executive Director Life Sciences, Healthcare Life Sciences, Oracle Corp. and Sal Marcuz, Master Principal Enterprise Architect, Oracle Corp. – Data: Your Secret Weapon for Innovation in Life Sciences

When “Anonymous” Isn’t: The UK Biobank Data Exposure and the Limits of Health Data Privacy

From Surveys to Signals: How AI Is Changing Voice of the Customer in Life Sciences

From Experimentation to Execution: How 2025 Reset the Trajectory of Healthcare and What 2026 Will Demand

ADVERTISEMENT