enlightenbio  Blog

It’s Not the Quantity but the Quality of Data That Enables AI/ML-Based Bioprocess Applications

Over the past 30 years, product titers have increased by almost 20% year over year (Langer and Rader, 2015). Since the 1990s, this has translated into dramatic increases in production, from an average of <1g/L to >5g/L in 2020 (Kelley, 2024). Looking ahead, significant improvements in product titers over the next decade will likely be driven by the development of predictive AI/ML models. Years of accumulated historical process data, as well as the improved ability to collect live process data from bioreactors, sensors, and analytical instruments, make this prospect increasingly possible.

Titer prediction models are broadly applicable in bioprocess, helping to identify how Critical Process Parameters (CPPs) affect Critical Quality Attributes (CQAs) and product yields. These models ensure tighter process control and enable real-time optimization or troubleshooting. Digital twins, virtual replicas of entire physical processes, could be deployed for in silico validation or testing prior to manufacturing. However, the impression that building a predictive model requires massive volumes of training data often results in not even starting this type of data modeling. In reality, more than enough data may be available for model creation.

“Half a dozen bioprocess runs are usually sufficient to build a predictive bioprocess model, if the right approach is taken.”

The Limitations of Physical Models

Physical models are built on the fundamentals of physics, chemistry, and biology governing a given process. By leveraging existing knowledge, these processes establish thermodynamic, stoichiometric, and physical limits on process parameters, which helps optimize their performance more effectively. A typical starting point might be the differential equation dP/dt = pX, which ties product formation to per-cell productivity and cell density. Because physical models are derived from first principles, they can perform exceptionally well with relatively little, i.e., limited data.

Physical models can also support engineering efforts, such as representing modified pathways through specific equations. However, a significant limitation of physical models is their reliance on the thoroughness and accuracy of  process characterization. Since the model is based on existing knowledge and assumptions, they are unlikely to uncover hidden relationships between parameters.

Data-Driven ML Models Run into a Performance Ceiling

In contrast to mechanistic models, data-driven models infer the relationship between titer and all available input data during the training process. This approach potentially surfaces the most important CPPs in an unbiased manner, as data-driven models do not rely on prior knowledge and assumptions. Examples of data-driven approaches include Partial Least Squares, Neural Networks, XGBoost, and Gaussian Process Regression. Though they typically require more data than physical models, deploying these models is relatively simple as they are often quick to set up and do not require deep process understanding.

One downside to this approach is that data-driven models often run into a performance ceiling. In our experience at Invert, using purely data-driven methods rarely results in R² values above 0.6, which may not be ideal for tightly controlled pharmaceutical production. The limitations of either data-driven or physical approaches are not easily circumvented on their own—but they might be best overcome by combining both approaches instead.

The Winning Approach are Hybrid Models

Hybrid modeling, which combines physical and data-driven approaches, leverages the strengths of both methodologies: physical intuition, the ability to uncover hidden relationships, strong performance (R² > 0.9), and reduced dependence on large datasets.

While sophisticated methods like Neural ODEs (Bayer et al., 2020) and PINNs (Yang et al., 2024) have been developed to tease apart complex production dynamics in biological systems, straightforward approaches can often be more adequate to model specific processes. By incorporating  known biological dynamics, a general step-by-step approach can be used to build high-performing models without requiring extensive setup.

Invert’s Step-By-Step Approach to Titer Prediction

  1. Get your data machine learning–ready. It’s crucial to ensure consistent formatting and units across all data, and then organize it by batch. While this can often be a time-consuming step given the numerous data sources and types in bioprocess, Invert makes this process easy and automatic. As a data scientist here, I spend relatively little time on data preparation, which allows me to focus most of my efforts on optimizing the prediction workflow.
  2. Sanity-check the data. This involves determining that the collected data is plausible and accurate: do the values seem reasonable? If you’re working with an abundant product, can you close the carbon balance? Techniques like Principal Component Analysis (PCA) can be very helpful in detecting anomalies or batch effects.
  3. Start with a naïve, time-series hybrid model. Our data science team at Invert favors the approach described by Cruz-Bournazou and team (2022). The basic steps of our workflow are depicted in Figure 1. This method involves converting state variables into volume-normalized rates and predicting them auto-regressively at each time step. This approach was specifically designed to address common limitations found in bioprocess data, such as limited or sparse data and uncertainty across process conditions.

Figure 1: A graphical representation of basic steps within the prediction workflow for the naïve approach.

  1. Evaluate performance on a withheld test set. If the naïve approach underperforms (e.g., R² < 0.7), it may be worth exploring more sophisticated methods like Neural ODEs or PINNs, particularly for intricately staged processes.

The Naïve, Time-Series Hybrid Model

In this section we explain how we replicated this approach with simulated mammalian cell cultivation data from Helleckes et al., 2024. If only the endpoint titer is available, you can assume a starting value of zero and allow the model to learn the progression. For a more detailed breakdown, we documented our process in Invert Notebooks.

Figure 2: Graph benchmarking model performance on withheld test runs by comparing predicted and actual final titers. With R² = 0.72, the naïve hybrid approach outperforms a typical data-driven model.

Addressing Remaining Challenges to AI/ML in Bioprocess

Currently, human decision-making drives most process optimization, yet it struggles with the sheer volume of process parameters and their potential interactions. We at Invert believe that leveraging AI/ML methods to fully explore the experimental design space will significantly enhance bioprocess efficiency. However, to facilitate the widespread adoption of AI/ML models, we must address a few important challenges.

Firstly, biological variability presents a challenge. The diverse nature of bioreactor operations, such as membrane bioreactors versus conventional ones, means that a one-size-fits-all modeling approach is difficult to apply effectively to bioprocess, as these variations significantly impact predictions. This variability highlights the importance of maintaining historical data specific to your process. Tailored training data leads to better model performance, even with limited datasets. Unfortunately, most historical data in the biopharmaceutical industry is often inconsistently formatted, messy, siloed, and lacks crucial experimental context, making it unreliable for training models.

This leads to the second challenge: ensuring that historical data is suitable for AI/ML applications. Invert directly addresses this by ingesting and structuring both historical  and current data as part of its typical implementation. We believe that leveraging existing data and automating data cleanup and standardization significantly reduces the time and cost associated with deploying models, especially since new data runs or curation require substantial time and resources.

Lastly, model performance over time is a critical challenge. As processes drift, predicted titers may no longer align with actual results. Incorporating advances in real-time data collection from process analytical technologies or low-latency bioreactor sensors could facilitate the development of models that evolve with the processes, which would be a major step towards fully automating titer predictions.

“Data forms the foundation of predictive models for titers and other key bioprocess metrics. However,  the quantity of data is not the most crucial aspect. Instead, having process-specific data that is complete, contextualized, and consistently structured enables far more applications and is therefore much more important.”

References

Bayer B, von Stosch M, Striedner G, Duerkop M. Comparison of Modeling Methods for DoE-Based Holistic Upstream Process Characterization. Biotechnol J. 2020 May;15(5):e1900551. doi: 10.1002/biot.201900551. Epub 2020 Feb 17. PMID: 32022416.

​​Cruz-Bournazou MN, Narayanan H, Fagnani A, Butté A. Hybrid gaussian process models for continuous time series in bolus fed-batch cultures. IFAC-PapersOnLine. 2022 Aug;55(7):204–9. doi:10.1016/j.ifacol.2022.07.445

Helleckes LM, Hemmerich J, Wiechert W, von Lieres E, Grünberger A. Machine learning in bioprocess development: from promise to practice. Trends Biotechnol. 2023 Jun;41(6):817-835. doi: 10.1016/j.tibtech.2022.10.010. Epub 2022 Nov 28. PMID: 36456404.

Helleckes LM, Wirnsperger C, Polak J, Guillén-Gosálbez G, Butté A, von Stosch M. Novel calibration design improves knowledge transfer across products for the characterization of pharmaceutical bioprocesses. Biotechnol J. 2024 Jul;19(7):e2400080. doi:10.1002/biot.202400080. PMID: 38997212.

Kelley B. The history and potential future of monoclonal antibody therapeutics development and manufacturing in four eras. MAbs. 2024 Jan-Dec;16(1):2373330. doi:10.1080/19420862.2024.2373330. Epub 2024 Jul 1. PMID: 38946434.

Langer E, Rader R. Biopharmaceutical Manufacturing: Historical and Future Trends in Titers, Yields, and Efficiency in Commercial-Scale Bioprocessing. BioProcessing Journal. 2015 Jan 16;13(4):47–54. doi:10.12665/J134.Langer.

Yang S, Fahey W, Truccollo B, Browning J, Kamyar R, Cao H. Hybrid modeling of fed-batch cell culture using physics-informed neural network. Industrial & Engineering Chemistry Research. 2024 Sept 19;63(39):16833–46. doi:10.1021/acs.iecr.4c01459

Karthik Sekar – enlightenbio Guest Blogger

Karthik Sekar, PhD, is a Staff Data Scientist at Invert, a software company focused on accelerating biomanufacturing through data management, analytics, and control systems. He graduated from Northwestern University with a doctorate in Chemical Engineering with an emphasis on metabolic engineering and synthetic biology.

After graduate school, Karthik commenced post-doctoral studies in system biology at ETH Zurich. He then moved to the Bay Area, where he worked for a few years at Emerald Cloud Lab as a software engineer and Climax Foods as the lead data scientist. At Invert, Karthik finds AI/ML-driven solutions for the industrial bioprocess industry.

Karthik Sekar – enlightenbio Guest Blogger

ADVERTISEMENT

Discover more from enlightenbio Blog

Subscribe now to keep reading and get access to the full archive.

Continue reading