enlightenbio  Blog

Sentieon Solves Complex Mathematical Problems – Applied to DNA Sequence Secondary Data Analysis – by Enhancing them With Extreme Accuracy and Efficiency

This month’s “Company Spotlight” provides a closer look at Sentieon, a developer and supplier of a suite of bioinformatics secondary analysis tools that process genomics data with high computing efficiency, fast turnaround time, exceptional accuracy, and 100% consistency.

Sentieon builds scalable and easily deployable tools that are drop-in replacements/improvements for BWA/GATK/MuTect/MuTect2. The thirteen-person company is headquartered in Mountain View, CA, and was founded in 2014 by Jun Ye and his team with expertise in algorithm, software, and system optimization.

The following summarizes questions and answers from my dialogue with Brendan Gallagher, Business Development Director at Sentieon.

EB: Tell us more about Sentieon – Your business is around delivering faster, more accurate genomic analysis software. What need(s) are you trying to address and what products/services do you offer?

Brendan Gallagher: We offer software tools for DNA sequencing secondary analysis to produce accurate and 100% consistent results without downsampling. These tools are flexible to our customers’ needs based on their assays/projects’ needs, support scalability and are cost-effective. Our tools enable our customers to perform high quality genomic analysis in their production environments while using either industry standard mathematic models like BWA/GATK, or our improved-accuracy models. For example, we produce the exact same BAM as BWA-MEM but it is about 2x faster and more efficient, and the overall pipeline is up to 10X faster than standard BWA/GATK software.

EB: Who are you targeting with the software product(s) you offer, or in other words who, is benefitting from using your analysis software solutions? Are you also targeting the clinical sector? Can you provide some examples of end-users and organizations that are currently using your software?

Brendan: Any BWA/GATK user is a potential Sentieon tool user since these users are all choosing the same great mathematical foundations developed by the Broad Institute. Therefore, these users can immediately enjoy the benefits of speed, consistency, and robustness if they use Sentieon tools.

Our tools are applied by Biopharma companies big and small, hospital and clinical systems, Direct-to-Consumer (DTC) genetic testing companies, genome centers, academic and research institutes, and molecular diagnostic companies. Hence, broadly speaking anybody who currently works with DNA sequence data can benefit from using our tools. Furthermore, if you are working with genomic data at the FASTQ, BAM, CRAM, or gVCF level we have a solution to help you with.

EB: How big (number of employees) is Sentieon and how many engineers are involved in developing the optimized software solutions?

Brendan:  We have 13 team members with about five focused on developing algorithms and software and the rest developing applications, providing support for our customers, or business development. We also have one engineer who is dedicated full time to Software Quality Assurance. Sentieon’s core engineers have been trained in physics, electrical engineering, computer science, and mathematics and with that have a strong expertise in algorithm development, signal processing, HPC systems, etc. Their focus is on enterprise software engineering while the support team primarily consists of bioinformaticians for support and collaboration with our customers. 

EB: How do you select what specific software to optimize for faster, more accurate outcome? In other words, do you predominantly take existing popular, best-practices software tools and optimize those tools, or do you also optimize “other less popular tools” and if so which ones?  

Brendan:  Sentieon’s skill is to accurately and efficiently solve complex mathematical problems in computer sciences. As the team investigated how to add value to the genomics world, they identified BWA-MEM, GATK, and MuTect as highly accurate tools for genomic data processing. This was a few years ago. We have since added Joint Calling and MuTect2 to allow our customers to use the same math with better software. Sentieon has also released new tools that improve upon these solid foundations, e.g., by adding machine-learning modules for better accuracy. These new tools do produce different, and based on the NIST truth sets, more accurate results. You can think of improvements like fixing some edge cases of the GATK mathematics. We typically license all of our tools to everyone so they can choose among the options. Generally they either want to be able to replicate an exact GATK-based pipeline or if they want “better accuracy” determined either by public truth sets or their own internal truth sets they choose our DNAscope and TNscope pipelines. It is up to our customers whichever tool they prefer.

EB: What defines the truth of an individual sample/analysis and with that the most accurate result and how does this affect your optimization process?

Genomics is an interesting and challenging industry since defining the truth on an individual sample is elusive. NIST is undertaking a heroic effort to define some truth sets, however these are still only for a handful of samples. When evaluating any one sample, it’s about choosing the best and most robust statistical methods and trusting in those methods to produce the most accurate results. As a result, we looked for the most accurate and robust math and statistics with which we could improve the computational algorithms and software implementation while keeping the results the same. This way people can utilize industry standard math, with improved performance, usability, and support.

As the truth sets and sequencing data get better, especially with the advent of longer read technology, we can explore improving our tools and setting better standards for accuracy. We are currently working on Machine Learning filtering for improved accuracy with our DNAscope and TNscope tools, de novo assembly, and other potential accuracy improving methods for specific applications like tumor only variant calling.

EB: How many of your products are “off the shelf” versus “customer-based or customer-optimized”? Do you also work with customers and build exclusive software tools?

Brendan: All of our tools are pretty much off the shelf, although we welcome our customer’s feedback and work closely with many of our customers. Our customers still use our individual software components to build their own “custom” pipelines. For the GATK replacement products, the product uses the same fundamental math, so it produces the same result as GATK. From an engineering and product standpoint, one of the biggest advantages of our non-Broad matching tools is that we can improve and adapt them based on feedback from our users to better suit their specific applications. We are also open to working with customers to develop functionalities as required by them, if and when such needs arise.

EB: There often is a trade-off between analysis speed and cost. Are your tools costlier because they are faster, or not?

Brendan: Our tools actually save money for our customers when you count all costs including computing cost. We have improved the core computing algorithms, so that they run faster and hence cost less in compute time. For our drop-in replacement of BWA-GATK for example, they are positive ROI investments for our customers; the saved computing cost is more than our software’s licensing fee. In other words, we can say “we are cheaper than free J”.

EB:  How can users run your various software tools? Can they be deployed anywhere such as in the cloud, behind the firewall, or even locally on a computer, or is there a specific system requirement?

Brendan: Our software tools can be deployed anywhere people have their data. We have customers running our tools on AWS, Google, Microsoft, private clouds, local clusters, individual server, and even desktops. There are basically no special hardware requirements; our tools can run on any CPU-based systems. Users can easily install our tools on any system and wherever they currently process their data.

EB: What are your tools performance and costs like?

Brendan: We have tools for various variant calling applications, for germline and tumor data analysis. It’s important to note that, besides fast speed, none of these tools downsample so they all process all of the data that is provided as an input to them. Many tools in this field use downsampling to enable better speed at the expense of accuracy and consistency.

Let’s take the following example: Using our optimized tool, a 30x WGS from FASTQ to VCF running on a single 36 vCPU server takes less than 4 hours, and on the new larger servers on AWS 72 vCPU and Google 64 vCPU it takes about 2 hours. Yet, users still have the choice to distribute their analyses across multiple machines with hundreds of cores for a turnaround time as low as 15 minutes. Clearly, the compute cost depends on the compute system used. On AWS on-demand instances the cost is around $5.00 USD per WGS, and it’s even less on Google Cloud. On Google Cloud, if you are using preemptible VM (virtual machine) instances (like spot instances on Amazon) the cost can be less than $1.00 USD – for more information see (and run) the Sentieon DNAseq Pipeline on the Google Cloud Platform (GCP). In comparison the Amazon spot instances are slightly more expensive compared to the Google Cloud.

To emphasize, the Sentieon software is agnostic to the compute environment. Therefore, Sentieon’s products are not tied to a specific hardware. The software comes as a binary executable and you can run/be deployed within any compute system.

An example of the “GATK Best Practices” pipeline using Sentieon tools.

EB: Who do you view as your current competition and why? What differentiates Sentieon from other players in the market?

Brendan: Anyone developing alternative software methods to, for example, BWA/GATK, or developing special-hardware-based solutions, can be viewed as competitors. However, we may also collaborate with them if they are open and interested to use our core computing engine in their approaches. Our software tool users generally like our DNAscope and TNscope tools since they are shown to be more accurate and customizable to their needs.

In our opinion the industry is still a long way from achieving accurate individual representations of the genome, so, the more methods being developed to improve the accuracy, the better. Sentieon’s skills are in solving complex mathematical problems extremely accurately and efficiently, so if any method is proven to be the best, we would consider improving it computationally. We have proven this approach with our results in public challenges like the precisionFDA Consistency Challenge and ICGC-TCGA Dream Challenge or recently the precisionFDA NCI-CPTAC Crowdsourced Multi-omics Sample Mislabeling Big Data Challenge. To-date our software has been used to win the most awards by any set of tools.

Everyone in this field is working to positively improve our understanding of biology and the impact of -omics on human health and the world, so we don’t really view anyone as traditional competitors. We all need to work together to improve our understanding and bring better resolution and accuracy to genomic data.

EB: What do you see as the biggest challenge(s) the genomics data analysis field is currently facing and why? How can we overcome these challenges? How important is standardization of data analysis?

Brendan: Standardization is very important. In fact it is one of the biggest challenges the field is currently facing. Sentieon is very pro-standardizing data sets. This enables data sharing without the need to re-run the samples; and this is more and more a factor as the sample size increases. A great example of this is the functionally equivalent pipeline for the Center for Common Disease Genetics (CCDG) project. They have defined a BWA/GATK-based pipeline so that each sample is processed the same way. Our tools also meet this functionally equivalent standard. As the standardized data are made public, anyone who has processed their data with the functionally equivalent pipeline can compare their data to this cohort without any pipeline bias. Therefore, anyone who adopts the CCDG pipeline for their internal data will be able to take advantage of all that public data.

We have customers who ran over 20,000 human whole-genome samples through the CCDG pipeline. This may seem like a lot, but it is only about 10% of the CCDG dataset, so by choosing their pipeline wisely, they now get to ultimately benefit from 10x more data. Since they used our tools, they were able to shrink their compute time by 10x, and then got another 10x of public data. This is a huge value add and greatly increases the statistical power of analyzing variant data.

There are many big challenges, but as the data processing portion is becoming more routine, Sentieon helps data analysts to achieve high accuracy, low cost, and systems that meet their needs for precision data. In my opinion, the technical challenges are solvable as long as people are willing and able to collaborate.

Brigitte Ganter

1 comment

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.


%d bloggers like this: