Big Data and AI Driving Rare Disease Diagnosis: centogene.com

1. Executive summary

CENTOGENE & Big Data

With big data being a key enabler of any successful artificial intelligence effort, CENTOGENE is ideally suited to employ artificial intelligence (AI) in its diagnostic workflow. We believe that we have the world’s largest curated mutation database for rare diseases, and in particular the biggest database for causal variants; since the predictive power and accuracy results of artificial intelligence rely heavily on better and more comprehensive the data, the size of our database combined with our AI expertise are creating a paradigm shift in our diagnostic approach and capability.

The Need for Variant Prioritization

Next generation sequencing (NGS) has revolutionized disease diagnosis and treatment ̶ approaching the point of providing a personal genome sequence for every patient. Typical NGS analyses of a patient depict tens of thousands non-reference coding variants, but only one or very few are expected to be significant for the relevant disorder. In a filtering stage, one employs family segregation, population frequency, quality, predicted protein impact and evolutionary conservation as a means for shortening the variation list.

However, narrowing down further towards culprit disease genes usually entails a laborious process of seeking gene-phenotype relationships by consulting numerous separate databases. Thus, a major challenge variant prioritization addresses is to transition from the few hundred shortlisted genes to the most viable disease-causing candidates.

CENTOGENE Variant Prioritization Tool

We developed a new variant prioritization tool to address the need of identifying the most viable disease-causing genes – a solution that leverages CENTOGENE’s unique data repository. By deploying our in-house artificial intelligence capability, we devised a strategy to prioritize candidate sequence variants on the basis of their segregation patterns, prevalence in internal and external population databases, impact and quality, along with the deep phenotype similarities. Importantly, the tool was found to have accelerated CENTOGENE’s Whole Exome and Clinical Exome Sequencing diagnostics process.

We were able to demonstrate that combining what we believe to be the world’s largest database of genetic information with an AI-based variant prioritization solution, CENTOGENE’s clinical score was found to perform superior to other companies’ tools available.

2. Context

Big Data & Artificial Intelligence in Diagnostics (background)

In the field of clinical diagnostics, big data and artificial intelligence (AI) are transforming the sector like nothing before. The availability of large and diverse biomedical data sets (big data) combined with huge increases in computing power are enabling previously unheard of analysis possibilities ̶ resulting in discoveries and diagnostic insights via AI.

Importantly, big data is the key enabler of any artificial intelligence efforts. Because AI systems need enormous data sets to train algorithms, the more data that is available the higher the predictive power and the accuracy of artificial intelligence. The better and more comprehensive the data, the higher the predictive power and accuracy results of artificial intelligence because machines can identify complex patterns and relationships in large amounts of unstructured data that has been derived from a variety of different sources. AI derives insights and makes predictions that are simply not possible by human analysis alone.

Variant Prioritization (background)

In the diagnosis of rare disease patients, variant prioritization is a vital step in discovering causal variants in order to identify disease-causing mutations. This is because the results of next-generation sequencing (NGS) technologies and applications, such as Whole Exome Sequencing (WES) or Whole Genome Sequencing (WGS), will often consist of a list of several thousands of variants of unknown significance, many of which are proved to benign (even though any rare variant has the potential to be pathogenic).

In simple terms, variant prioritization accelerates and simplifies variant interpretation because the results enable the interpretation of variants of unknown significance. It is a process that with filtering identifies which variants found via genetic testing are likely to affect the function of a gene. Prioritization scores enable the diagnosis of a patient, and indeed rare disease diagnosis relies heavily on variant prioritization scores in order to determine which variants are likely to affect the function of genes.

There are many academic and commercial tools available today for filtering out variants that are deemed unlikely to cause disease. These tools prioritize variants on the basis of segregation, minor allele frequency, predicted pathogenicity, text-mining, dbSNP information, and genotype quality. However, these methods used in a standalone fashion have typically not been able to identify the causative variants underlying a patient’s phenotype, and require additional investigation such as external databases, and also the identification of shared rare variants in unrelated individuals with similar diseases.

3. Issue

CENTOGENE’s Data Repository

At the core of CENTOGENE’s platform is our data repository CentoMD®, which includes epidemiologic, phenotypic and heterogenetic data, and allows us to assemble an extensive knowledge base in rare hereditary diseases. As of May 31, 2019, our CentoMD® database included curated data from over 420,000 patients. CentoMD® brings rationality to the interpretation of global genetic data, and we believe it is the world’s largest curated mutation database for rare diseases, and in particular the biggest database for variants of >7.3 million variants and a significant number of unpublished variants.

Importantly, before adding information into CentoMD®, our experts use evidence-based criteria to validate the interpretation of the data. We use a combination of computer-based tools and manual curation by professional scientists with strong backgrounds in human genetics. Our team of scientists collects, annotates and reviews the phenotypic, genetic and epidemiologic data of patient samples to ensure the highest medical validity of each sample. We also employ Human Phenotype Ontology (‘HPO’) coding to accurately track and standardize sample phenotype and genotype data.

Our methodological approach to information curation ensures we provide highly accurate data relevant to clinical diagnoses and decision-making by humans ̶ and indeed for the development of artificial intelligence solutions.

While CENTOGENE’s data repository integrates all relevant structured and unstructured patient data including metabolomic, proteomic, genetic, as well as health records and clinical information, and especially longitudinal data e.g biomarker or patient recorded outcome (also called as ‘real world data’), as well as diagnostic workflow data, the sheer size and continued expansion of the data repository indicates that a human-based manual approach by itself in utilizing the data is no longer adequate.

As a result, CENTOGENE is increasingly placing AI at the heart of its diagnostic and pharmaceutical solutions. The application of AI enables us to find statistical relationships faster (speed), draw more exact conclusions about relationships in the data (accuracy), and discover patterns that cannot be found with traditional methods (complexity).

Discovering Variants

CENTOGENE’s Clinical Exome Panel, which includes ~6,600 clinically relevant genes with known associated clinical phenotypes and covers over 3,200 diseases ̶ typically discovers between 70,000 and 150,000 variants per individual. However, the vast majority of variants discovered are benign or unrelated to the observed disease phenotype of the patient. This applies also when we use our 200+ disease-specific panels.

Discovering causal variants from huge unfiltered data is an extremely challenging process for any diagnostic expert due to the presence of around 500-800 high impact variants (nonsense variants, frameshift, splice sites) unrelated to the presented disease phenotype. The identification of a single, plausible disease-causing mutation for a particular disease has proven to be extremely difficult.

For CENTOGENE in particular, to aid in the prioritization of candidate disease-causing variants, our genetic experts have traditionally used different sources of information including CentoMD®, deploying various analytical tools, as well as having our medical experts performing manual analyses separately and then pool the results together. CENTOGENE’s diagnostic workflow has traditionally required considerable manual work by medical experts in selecting the causative variant out of tens of thousands variants. This approach is both time-consuming and demands a considerable understanding of each of the tools that are used in the analysis.

Industry methods of interpretation have faced considerable challenges in integrating all available information into one simple scoring schema that would allow easy ranking of variants of prioritization. The power of a new approach to prioritization by CENTOGENE would necessarily come from two sources: a) the integration of patient, gene, variant scores into one score and b) the integration of in-house knowledge derived from big data analysis into a second score.

Issue at Hand

Given the sheer size of CENTOGENE’s in-house data sets combined with the ever-present need to accelerate the diagnosis of rare disease patients, CENTOGENE identified a requirement for a state-of-the-art artificial intelligence solution for variant prioritization, and aims to shift away from the traditional manual approach typically used.

Requirements for such an in-house prioritization solution are that it should:

Leverage the knowledge and insights obtained from CENTOGENE’s data repository of more than 420,000 analyzed individuals of over 1.6 billion unweighted data points.
Be based on a well-defined set of HPO terms.
Be able to demonstrate its efficacy by assessing CENTOGENE’s clinical score performance against other commercially-available variant prioritization technologies.

4. Approach

To overcome the difficulties in CENTOGENE’s workflow that required considerable manual work for selecting the causative variant out of tens of thousands of variants, we developed a state-of-the-art AI solution for variant prioritization to improve CENTOGENE’s efficiency in the diagnostics workflow, and then benchmarked it against externally available variant prioritization tools.

Our approach is summarized as follows:

a. Prioritize the pathogenic variants by filtering by HPO similarity, in-house database counts, minor allele frequency, mode of inheritance, variant quality, and successive assessment of a variant based on its potential to affect protein integrity and function.

b. Develop a custom clinical score algorithm:

i. For calculating a clinical score that makes use of the phenotype of the patient/family, as described by a set of HPO terms, as well as in-house data and public data bases.

ii. That is structured into four parts: HPO gene score, HPO patient score, clinical score and clinical rank.*

c. Benchmark against external variant prioritization tools.

*While the HPO gene score captures the overlap of the patient’s phenotype with each variant containing gene, the HPO patient score focuses on CENTOGENE’s existing in-house data to learn about the potential of each variant. The clinical score unifies both of them into one easily interpretable number. Finally, the clinical rank combines all our experience in manual hypothesis-based filtering of variants with the clinical score and provides an overall ranking according to the clinical relevance of each variant. Therefore, the top ranked variants were used for the basic mode filter in the newly-developed tool.

A New Tool

Driving our approach was a focus on combining expert knowledge and machine learning (ensemble learning), to implement a comprehensive version of variant prioritization that would consider all 23 published variant predictors available. The building of the new tool was based on running an iterative ensemble procedure on a training set that consisted of large numbers of in house variants and variants from public databases.

This approach also enabled the development of a tool that is now uniquely available to CENTOGENE’s diagnostic scientists. For each variant, the tool compared HPO terms provided for a new patient with the HPO terms from other patients contained in CENTOGENE’s data repository with variants in the same gene. Expressed as the “HPO patient score,” this type of information has proven critical for reaching a rapid diagnosis, with patients with very rare or as yet undescribed diseases benefitting the most.

The approach enabled the design, implementation and performance of the “clinical score” which is the key element of variant prioritization. The validation comprises a comparison to external tools as well as extensive performance assessment on large in-house data sets contained within CentoMD® with previously reported variants.

Underpinning the approach was the deployment of AI; scores were developed using random forest bagging boosting strategies as well a custom developed decision tree approach for integrating, phenotypic, genetic, and variant centric scores.

Benchmarking Approach

In order to assess the performance of CENTOGENE’s variant prioritization tool compared with other tools in the market, we selected a number of WES data sets and ranked variants from each prioritization tool, and compared the analysis using CENTOGENE’s own algorithm.

Specifically, the data sets we used for this assessment was as follows:

Validation dataset I: The 50 WES cases tested in this validation study contain 40 trio (index(affected) :mother(unaffected):father(unaffected)) samples and 10 solo cases (affected index only). The causal variants for these samples were previously identified through manual filtering of the data and were classified as either ‘Likely pathogenic’ or ‘Pathogenic’.
Validation dataset II: The 2608 WES cases in validation dataset II contain 2685 pathogenic/likely pathogenic variants. Samples from both male and female patients were included, of which some were part of consanguineous families. The causal variants for these samples were previously identified through manual filtering of the data and were classified as either ‘Likely pathogenic’ or ‘Pathogenic’.
Validation dataset III: The 952 WES cases in validation dataset III contain 1021 VUS variants. Samples from both male and female patients were included, of which some were part of consanguineous families. The causal variants for these samples were previously identified through manual filtering of the data and were classified as ‘VUS’ (Variant of Uncertain Significance).
Validation dataset IV: The 345 clinical exome panel cases in validation dataset IV contain 345 pathogenic/likely pathogenic variants. Samples from both male and female patients were included, of which some were part of consanguineous families. The causal variants for these samples were previously identified through manual filtering.
Validation dataset V: The 262 clinical exome panel cases in validation dataset V contain 262 VUS variants. Samples from both male and female patients were included, of which some were part of consanguineous families. The causal variants for these samples were previously identified through manual filtering of the data and were classified as ‘VUS’

The main metric used in the benchmark is the rank provided by our method for previously reported variants. The ranks are then grouped into categories such as rank1, top5, top10, top25, top50, top100, top1,000 and above top1,000.

5. Results

CENTOGENE’s in-house solution for variant prioritization was seen to outperform external tools due our extensive database ̶ CentoMD®, efficient filtering strategies, deep-phenotypic methods, and artificial intelligence expertise. We were able to demonstrate the superior performance of CENTOGENE’s variant prioritization solution over external tools. Importantly, the resulting solution outperformed other tools with regards to sensitivity and specificity for flagging ‘pathogenic’ and ‘likely pathogenic’ variants.

Causal Variants Ranking

Company 1: Whereas Company 1 ranked the causal variant as top candidate in 11 of 50 cases (22%). Moreover, the causal variant was ranked within the top 25 for 37 cases (74%). For the remaining 3 cases (6%), the causal variant was not ranked due to no score for Company 1 or wrong transcript annotation.

Company 2: In comparison, ranked the causal variant as top candidate in 27 cases (54%), within the top 5 in 33 cases (66 %), within the top 10 in 39 cases (78%) and within the top 1000 in 42cases (84%). Their Tool did not rank the causal variant for 8 cases (16%) due to insufficient phenotype overlap.

Company 3: In addition, we also ran Company 3 with our 50 validation dataset and it ranked the causal variant as top candidate in 17 cases (34%), within the top 5 in 27 cases (54 %), within the top 10 in 35 cases (70%) and within the top 1000 in 48 cases (96%). For the remaining 2 cases (4%), the causal variant was not ranked due to no Company 3 score.

CENTOGENE: ranked within top 25 variants for 92% of the total 50 cases (n=46) and all the cases are ranked within top 1000 variants.

Benchmarking analyses with 50 WES cases

6. Conclusion

With the exponential growth of biomedical data sets in genetic diagnostics, it is increasingly important to combine large data sets with artificial intelligence in order to accelerate patient diagnosis. Variant prioritization is important in patient diagnosis because it helps identify likely variants from the results of next-generation sequencing technologies. With a number of academic and commercial tools available for variant prioritization, CENTOGENE developed an in-house AI-based tool, and benchmarked it against similar external tools used by its peer companies. The results demonstrated the superior performance of CENTOGENE’s tool against others tested.

To summarize, our variant prioritization solution has several strengths:

It addresses a central challenge in this process by performing fast, comprehensive and highly accurate prioritization by deploying artificial intelligence.
The method incorporates a large number of affected and unaffected patients from CENTGENE’s extensive in-house data base and this information will be useful to improve and penalize relevant and irrelevant variants.
The results demonstrate that CENTOGENE’s state-of-the-art solution - driven by our in-house AI systems - outperforms external state-of-the-art tools for prioritizing specific variants that could be linked to a specific disease.