Truwl is on a Mission to Make Bioinformatics Comprehensible and Accessible

This latest interview is with Karl Sebby, the CEO of Truwl. Truwl is an exciting, relatively new bioinformatics company that engages in accelerating and simplifying biological research. Truwl is also the name of the platform the company is developing. To-date, Truwl has raised $1.7M in seed funding and has currently five employees trying to address the challenges associated with bioinformatics data analysis.

The following summarizes questions and answers from my dialogue with Karl Sebby.

Enlightenbio: Tell us a little bit about Truwl and the genesis of its name. How did you come about to build Truwl?

Karl Sebby: We wanted a name that was agnostic since you may not always take the path initially envisioned – it all started with the two founders living on different roads that were both Creek names. I was on Trumble Creek Road and he was on Owl Creek, hence the result was a mash up between Trumble and Owl. We also liked that it paid homage to workflow languages, which we build on with the W-L in it. Lastly, we liked “True” in it, the ability to find truth in biology.

We started the company out of my initial frustration with trying to reuse published methods. My background was in physical and analytical chemistry and teaching. As I got started with a cancer research project, I found I could read the literature, but reimplementing published analyses and building off of those was frustrating. There was a feeling of inability to take off from where others left.

It seemed like such a waste that people were starting from scratch over and over again. In computational research, one should have straightforward access to analysis methods. Analysis methods should easily be shareable, independent of an instrument or an assay type. Hence, the initial idea was to make analyses shareable in a better way so they can be reused by others. And not just the methods, not just a container, but a complete analysis that someone could take and then start playing with without setting anything up. Researchers can easily collect data, but have trouble analyzing it, which is a well-known bottleneck.

Another issue that spans all of science, is a reproducibility crisis. It’s challenging to reproduce experiments that others have performed and published. Hence, the needs we’re addressing are very well established. By providing easy access to bioinformatics capabilities we can address the bioinformatics bottleneck and reproducibility issue. Although we can do method development, Truwl’s main expertise is taking existing methods and putting them in a form that is discoverable and ready-to-use .

“The platform is primarily a methods distribution platform to support researchers’ analysis needs.”

The distribution of methods can be of any scope, publicly (external) or limited within an organization (internal).

EB: You mentioned Truwl’s analysis expertise. Do you provide analysis services, so somebody that works in pharma or the clinical sector can come with hundreds of samples and you help them with their data analysis?

KS: Yes, we do. We haven’t advertised this aspect much, but we have worked on several academic collaborations and early stage biotech projects. It is not our main focus or revenue source, though it has two main advantages:

It provides us a way to understand what our customers’ needs are.
Non-proprietary methods we use on custom projects can be made available via the Truwl platform to a broader audience which allows us to grow our content.

EB: What is the ultimate main driver for Truwl? Sharing methods and entire analysis pipelines, and creating an environment to run analyses after users uploaded their data?

KS: Yes! We’re completely cloud-based, so the workflows on the system are runnable directly on the cloud from the platform. The data has to be on the cloud somewhere, with the user specifying a file URL as an input and Truwl putting the outputs in a cloud bucket. We can manage data for customers, but we also allow them to keep it on their own accounts and just provide access to our system. Depending on the subscription level, we either provide individual download links for the outputs, or we can provide direct bucket access.

All can be managed via a web-based User Interface (UI) which allows users to select the data and parameters and execute workflows. We offer a workflow input editor similar to other platforms, with one difference that we want to make it more accessible. So, someone could come in, run a job without ever talking to a sales rep or booking a demo and taking advantage of a pay-as-you-go plan. If a researcher wants to use our platform one time for just a few samples, we’re happy to support that.

To make it all more accessible, we provide pipelines with a lot of valuable information – pipelines come with:

A description.
Transparency with estimated job costs:
- Examples of runs with time and cost details.
Tutorial videos that show someone how to run the pipeline.

An early partner of ours is the ENCODE Data Coordination Center, and as such we provide all of the major ENCODE pipelines. Since the development of these methods was an important aspect of the project, they should be made available to the widest audience possible. To-date, we have made pipelines from other well-known projects available, including GTEx, GATK, and BioWDL.

Figure 1: ENCODE ATAC-seq Pipeline (screenshot Truwl platform).

EB: You mentioned that researchers are challenged with undertaking a bioinformatics analysis, not knowing where exactly to start. Do you believe your platform is appropriate for non-bioinformaticians, or do you think you still need to have at least a little bit of knowledge of how to handle it?

KS: A little knowledge is always helpful, but our goal is to make it as easy an experience as possible, especially for those without prior experience. That is where the community and publishing complete examples come into play. We enable users to share complete examples that others can use as a starting point for their own analysis.

Existing pipelines with pre-defined parameters: It’s often confusing to know what parameters to use for a specific analysis, and what the inputs should look like, etc. For this purpose we implemented a fork button that allows someone to look at a shared example and make a complete copy to start their own experiment. One can hit fork, hit run, and run the exact analysis previously published. This way someone can hit the ground running right away and start building their own analyses off someone else’s.

We’re also providing an input editor, which allows someone to rerun an existing experiment with public or their own data and basically repeat someone else’s experiment.

EB: Will the researcher know upfront the estimated cost of running an analysis, or will there be a bad awakening after completion?

KS: The pipeline description pages show the cost of example jobs. Those are pretty good estimates and help users see the per sample costs before scaling up so there are no surprises.

24 hours after job completion the cost gets calculated and shared with the user.

EB: Who are you targeting with this community platform, who is benefiting the most from this type of solution? Are you targeting the clinical sector, or are you targeting individual end users /researchers in academia and pharma?

KS: Our early users are mostly in academia/research, with some commercial users. As the platform matures our goal is to expand into the biotech and clinical sector. On the clinical side, we’ll be targeting genomic testing companies, both established companies and those that want to bring genomic testing closer to regional healthcare systems.

We did add a variant benchmarking workflow (see Figure 2), and our vision with this specific workflow is workflow optimization, validation, and re-validation for clinical testing. When you run a standard, you want to make sure that you’re detecting the variants that you say you can detect. With our variant benchmarking pipeline we provide a performance metrics table across many jobs, which is a unique feature of our platform

We’re at early pilot stages with clinical testing companies. Some of the challenges they encounter are associated with revalidation, running standards, and bringing on a new instrument or a new lab.

Figure 2: Performance Metrics Variant Benchmarking Workflow (screenshot Truwl platform).

EB: How are you reaching your target audience?

KS: A lot like GitHub, in a sense that you can reach a lot of content without being logged into the platform or even having to create an account first. Our site is pretty open. When performing a Google search, people can find our site and see public examples. There is also word-of-mouth, some direct contacting, and reaching out to individuals in a specific research area. Some of the workflows have actually gotten quite a bit of users, for example, the ENCODE workflows (e.g., atac-seq-pipeline [see Figure 1]). There are instructions for how to run those workflows on Truwl from the GitHub repo.

“One of our primary strategies is working with early partners that know our end users’ needs in depth. This allows us to integrate components that our users need to make our platform a more complete solution and reach our target audience.”

EB: How do you decide which workflows to create and make runnable? Is this based on community requests or voting?

KS: Our initial development efforts focused on Workflow Description Language (WDL) workflows. Hence, the workflows that are currently productionized are well-vetted WDL workflows. Past that, it comes down to what our users ask for, or what we see as a need that we have to fulfill.

Something that does differentiate us from a lot of other places, is our aim to be workflow-agnostic. While we started out with WDL, we’ve also been playing with Nextflow. In fact, we are close to releasing the first Nextflow pipelines soon. The content in Nextflow Core is very attractive with more than 50 total workflows and 30 of those being production ready. The Nextflow community has done a fantastic job of standardizing productionized workflows. While researchers care about the design of the workflow language, I believe it comes down to the content. First and foremost, people are looking for analysis methods that are robust, well-annotated, and are validated by the community.

Our goal is to eventually support the four major workflow languages in bioinformatics: WDL, Nextflow, CWL, and Snakemake, so users can access and use methods in a single place in a uniform way, without being concerned for how the underlying code is written.

“One thing that does differentiate us from a lot of other places, is our aim to be workflow-agnostic.”

EB: Is Truwl looking into working with a community like a Nextflow directly?

KS: We’re not that involved with workflow development unless we encounter bugs we need to fix, but we’re the ones that help complete that last mile of accessibility. Right now, I think via Nextflow Tower (Seqera Labs) one can launch scripts right from the Nextflow Core website, but there are still steps that require integration with different systems. We’d want to support them natively from Truwl, and be a general workflow runner, where you can find any content and see examples, and try to find the thing that’s right for you.

In bioinformatics, the problem that’s been talked about over and over again is the need for a user interface (UI) for biologists that are not comfortable with using the command line. While true, that’s really only one part of it, as you still need to know what’s actually available, how to use it, when is it appropriate to use it, and then once you’ve used it, how do you know you did it somewhat correctly?

Picking the right workflow is so important, because you tend to stick with it for a long time, especially in the clinic. One has to lock into the version, the results/output, everything!

EB: Why should a genomics researcher come to your website and use your tools? What are the advantages you offer over other platforms that provide genomic data analysis tools and pipelines? Is it analysis speed, cost, or something else?

KS: We remove as much friction as possible and make our platform and its content easily accessible. Installing software and getting things to work is a well-known pain point and can take a lot of time and effort. We have workflows with a web-based input editor and the underlying infrastructure set up on the cloud and ready-to-go so users can get started. We provide complete examples of analyses/jobs. We are offering a pay-as-you-go plan that doesn’t require a subscription, and we are providing pricing estimates on workflow pages so there are no bad surprises.

EB: Who do you view as your current competition and why? What differentiates Truwl from other players in the market?

KB: Bioinformatics is such a scattered space which makes it hard to know what tools and platforms to use and why. There are well known commercial platforms such as DNAnexus and Seven Bridges; there are academic-focused platforms like Galaxy and Terra (an academic-commercial partnership); but then there are a bunch of area specific tools and platforms for topics like the microbiome, or workflow languages like Sequera with Nextflow. There are many more out there and several endeavors have come and gone, with new bioinformatics-focused platforms popping up all the time.

A common theme that is talked about a lot and is documented in the literature is centered around making bioinformatics more approachable by enabling users to do analyses without using the command line. This is important, but is only one of the barriers that exist for making bioinformatics capabilities more accessible. There are many questions that require an answer before we can actually execute a method, such as:

What methods are available?
What is the right method for my specific project?
What parameters should I use to run a method?
Once the analysis is complete: Did the workflow run as expected and are the results as expected?

Biology is all about making comparisons, and computational biology is no different. A main focus of ours is enabling comparisons. This includes comparing results to truth sets in benchmarking experiments and comparing metrics across compute jobs.

We have built out our system to easily enable comparisons such as the aforementioned, by automatically pulling metrics out of files and feeding them into a comparison table directly on the platform.
We have applied this to benchmarking workflows, but the system is general and can be applied to any workflow to track any metrics of interest. This has applications for developing, optimizing, validating, and revalidating workflows in genomic testing but has a lot of other use cases as well.
Making these types of comparisons is really important, but to this day nobody else has made it a primary focus.

We’re also not limiting ourselves to any specific workflow language or technology. We believe providing a common interface to all these methods would be a game changer and enables us to host a huge variety of ready-to-use methods, and enables users to compare, evaluate, and execute methods from a single platform.

“Two of the most important aspects of a successful bioinformatics platform are accessibility and content, and we have developed the Truwl platform with these considerations in mind.”

EB: What do you see as the biggest challenge(s) the genomics data analysis field is currently facing and why? How can we overcome these challenges? How important is standardization of data analysis?

KS: There are so many challenges – let me highlight a few of them:

Standardization: there are so many areas in genomics that would benefit from the development and uptake of standards, more standardized samples, standards around data security and access, and of course standards for data analysis methods. There are many situations where data analysis MUST be standardized. This includes collaborative projects, like work performed in consortia, where results are coming from multiple sources with the intent of combining and comparing the data. This simply can’t be done with confidence unless everyone is processing their data the same way. Lack of standardization also makes it hard for newcomers to know what to use. Everyone is doing things their own way, as a result there is a lot of time spent on repeated efforts. Keeping up with the technology is so hard. First one needs to know what is current, what is robust, what is going to provide good results, then how to use it. Nextflow core has been phenomenal in this area, first for providing a set of curated pipelines, followed by setting standards on how the pipelines should be written.
Data access: another well-known and ongoing pain point. There are a lot of parts to this. Determining who can have access is only the starting point. Providing and tracking access and ensuring that the data doesn’t leak outside of set boundaries is a big concern for data owners. Not being able to access data is the primary reason that workflows don’t run successfully on Truwl on the first try. Even if our system is supposed to have access, there often is a permission setting that wasn’t set properly.
Bioinformatics methods: are a necessary part of the genomics ecosystem, but they require maintenance and continued development to become (and stay) high-quality, robust, and well-documented and to be used and trusted widely. Unfortunately, the rewards system for the maintainers and developers isn’t there so projects often become abandoned, or individual researchers have to optimize them themselves, often on the side. Maintenance and support of software needs to be recognized as an important scientific output, and there needs to be funding for it. I was excited to see the Chan Zuckerberg Initiative (CZI) providing some grants in this area, but that is just a start, there needs to be much more.
Methods testing and validation: Testing and validation of workflows doesn’t get as much attention as needed. When a workflow fails it’s hard to understand where things went wrong. Similarly, when a workflow succeeds you still need to verify that everything is correct. A generalizable framework for testing workflows and a community-maintained collection of testing datasets would be beneficial to everybody.

“Maintenance and support of software needs to be recognized as an important scientific output, and there needs to be funding for it.”

EB: Is there anything else you would like to share with the readership?

KB: We are beginning pilot programs with partners for our workflow comparison system, so we are interested in connecting with organizations that have needs around workflow validation, re-validation, and ongoing assay performance monitoring.

The Dangerous Bet Behind Big Pharma’s Silence

DNA Origami used to build novel vaccines

When “Anonymous” Isn’t: The UK Biobank Data Exposure and the Limits of Health Data Privacy

ADVERTISEMENT