enlightenbio  Blog

Why Are We Still Talking About Data Infrastructure for Life Sciences in 2025?


“In my twenty-five years in the Life Sciences industry, I’ve observed countless organizations adopt, revise, dismantle, and rebuild their data infrastructure.”

While the rapid advancements of data technology certainly explains some of this churn, it also makes me wonder why the industry’s very foundation – empirical data – lacks a stable, performant, flexible system that can endure across different technological eras.

I think the reasons for the continuing problem are many-fold:

  • Heterogeneity and Emerging Data Types: Current data technologies often work best with bespoke solutions, like mature DNA / RNA sequencing pipelines. However, these specialized systems, such as those used in clinical trials and Electronic Data Capture (EDC) struggle to adapt to new data types like biomarkers. This often necessitates the development of new, aggregative systems downstream.
  • Organizational Instability and Leadership: New executives and managers often have a mandate to transform or upgrade existing tooling with new technology – that’s partly why they were hired; there’s a mandate. See also: Data Architecture Consultants. While this can be beneficial, employee turnover at the developer level can lead to maintenance problems, especially when original system engineers leave without sufficient documentation.
  • Cloud adoption and Legacy On-Prem Systems While most organizations are comfortably operating in the cloud, there is still a collection of legacy on-premise data systems waiting to be migrated. Furthermore, some systems are moved to the cloud without being adapted to leverage its benefits, essentially being “lifted and shifted” into containers without optimization.
  • Federated Analysis Challenges: While large biomarker and clinical data sets are increasingly available, particularly for DNA / exome variants, they are often only accessible and useful when aggregated into a single source. Thousands of institutions and disease communities host rich datasets that patients would willingly share for biomedical discovery. However, regulatory, privacy, and technical limitations currently require individualized licensing and access for each institutional data source.
  • Data Systems Designed for Storage, Not Usage: This, I believe, is the biggest issue contributing to ongoing data problems, with some ramifications:
    • Complex Schema and Local Transformation: When data storage technologies present data in a complex schema, extensive transformation and modeling to be made useful (e.g., table joins, integration with other sources, cleaning, harmonization with ontologies, conversion to data frames) are required to make it useful. Many organizations do this important work at the research / business unit level and the processed data remains siloed within that unit, rather than being returned to the centralized data store. Quite often, individual coders perform this work on an ad-hoc basis for specific purposes, leading to a loss of development history.
    • Data Warehouse Proliferation: Some organizations have adopted data warehouses as the gold standard for analysis. While highly useful, the constant emergence of novel use cases for data often requires a separate, bespoke table or view. This can lead to a complicated proliferation of the data warehouse, leading to catalog, governance, and maintenance issues.

“Data systems designed for storage, not usage, remain the single biggest obstacle to progress.”

From FAIR to Data Mesh and Data as a Product

In the Life Sciences industry, we have been implementing FAIR principles for our data for many years. These principles are easily applied at the granular level, where each empirical or ontological data source is made available and served as a FAIR unit. However, difficulties arise with higher-level integrations of multiple FAIR (and non-FAIR) data sources. This raises questions whether the provenance of each FAIR data source carries it all the way through into the aggregated version and if aggregated tables in a Data Warehouse are also FAIR?

Meanwhile, outside of the Life Sciences industry a new data architecture paradigm, Data Mesh, has emerged. The Data Mesh concept shares many similarities with FAIR but also considers both the organization and the individuals involved in collecting, processing, storing, and using data, as well as the challenges of combined, integrated data products. Most significantly – at least to me – Data Mesh introduces the concept of Data as a Product, which aligns closely with the product-like requirements of FAIR principles for data sources.

However, Data as a Product extends beyond FAIR because it considers the practical use cases of that data and the personas of people using it. Crudely put, you can make poor quality data in an overcomplicated, bespoke schema completely FAIR, but it would still fail the usability requirements of a Data Product.

When I first read the Data Mesh concept five years ago, I was very excited about its potential and I nearly jumped out of my chair and shouted “Yes, finally!” However, I’m lucidly aware that good ideas take time to be adopted, especially within large organizations with extensive data and legacy systems. It will take more time before these pragmatic concepts take hold, particularly with the rapid emergence and evolution of AI, LLMs, and agents.

Moving Towards an Application Layer for Data & AI

Let’s talk about ideal data-serving systems, referred to as Data Products, and its implications for AI:

  • Clean and Harmonized: Data must be clean and consistent, utilizing standard vocabularies and ontologies.
  • Attributable and Versioned: Data must be traceable to its source(s) and versioned to track changes over time.
  • Well-Modeled: Data must be well-modeled for specific known use cases, often presented as a single table or data frame.
  • Self-Documenting and Accessible: Data must be self-documenting and available through standardized methods and APIs.
  • Embedded Algorithms: Beyond simple querying, data sources should offer embedded, versioned algorithms that can be executed individually or as part of a federated analysis.

Imagine a data source that is as easy to connect to as a browser connecting to an HTTP web server (see Figure 1). Such a system would provide methods to serve up all pertinent metadata, guide users on how to slice and query to retrieve data frames, and even give additional methods for running embedded algorithms, returning results in a consumable [visual or computational] format. You could seamlessly integrate this data into routines and pipelines using R or Python scripts, or instantly connect dashboard and analytics platforms, all while preserving proper access and authorization controls. Furthermore, should you need to combine this data with another source, the process would be equally simple and efficient. There would be no need to understand complex table schemas or file formats; data would be easily discoverable in a catalog, allowing for instant permission or license requests from the owner. This would enable instant and attributable federated analysis across all applicable data sources. This is what we mean by Data Product.

Imagine a data source that is as easy to connect to as a browser connecting to an HTTP web server – clean, versioned, self-documenting, and instantly accessible.”

Figure 1: A diagram showing how Data Products built from disparate sources can be accessed via harmonized API protocols. Source: Tag.bio.


AI and Model Context Protocol (MCP)

Finally, consider how an ideal Data Product or Data Mesh should function in the context of AI tools and agents seeking to leverage this data to answer questions. The findability, access, and interrogation of the data source [and its metadata] must be entirely computational. If these Data Products are federated, especially across disparate sources like clinical trial data, biomarkers, EHR, or knowledge bases, they should still be accessible via a standardized API layer. This ensures that computational AI tools and agents can interact with them via a single communication protocol, avoiding ad-hoc combinations of SQL and flat files in buckets.

Very recently the Model Context Protocol [MCP] has taken over the rapidly expanding AI technology space. I believe it represents the missing link between generically trained LLMs, their agents and disparate / esoteric / complex sources of data.

At Tag.bio we have layered an MCP Server (see Figure 2) as an access point into each of our Data Mesh installations. This makes any Data Product, or combination of Data Products accessible to an MCP client – i.e. LLMs and agents – in exactly the same way that any web server is accessible to any web browser in the World Wide Web.

Figure 2: A diagram showing how the Model Context Protocol provides a unified, translating gateway for LLM and AI clients to connect to disparate Data Products within a Data Mesh. Source: Tag.bio.

Jesse Paquette – enlightenbio guest blogger

Jesse Paquette is a computational biologist and data architect
specializing in data modeling and visualization. His professional aim is to facilitate discovery by delivering data into the minds of expert biologists. He is currently Chief Science Officer at Tag.bio.

Jesse Paquette - enlightenbio Guest Blogger

ADVERTISEMENT

Discover more from enlightenbio Blog

Subscribe now to keep reading and get access to the full archive.

Continue reading