As the cost of sequencing declines and instruments and datasets will become increasingly accessible, a highly automated pipeline for data analysis is critical for transition from technology adoption to accelerated diagnostics.


  1. Bioinformatics

In-house bioinformatics pipeline

CENTOGENE’s Illumina Bioinformatics pipeline for the analysis of WES data is based on the 1000 Genomes Project (1000G) data analysis pipeline and GATK best-practice recommendations and is composed from the widely used open source software projects bwa 0.7.10,1 Picard-tools 1.138 and GATK 3.2,2; 3 snpEff 4.1,4 BEDTools 2.25.0,5 bbduk v35.x and samtools 0.1.19,6 freebayes 0.9.20, and custom-developed software (Figure 1).

Figure 1. Workflow diagram of CENTOGENE’s Illumina Bioinformatics pipeline

How it works

Firstly, raw sequencing reads are converted into standard fast format. The quality of the sequencing reads is assessed for checking of any deviation from expected quality that would prevent from further analysis of sequencing data. Then short reads are aligned to the currently validated build of the human reference genome using BWA software with the MEM algorithm.

The alignments are converted to binary BAM file format, sorted on the fly and deduplicated without intermediate input-output-operations to temporary files to achieve maximal performance. The primary alignment files for each sample are further refined and augmented by additional information following GATK best-practices recommendations. Base Quality Score Recalibration (BQSR) is applied to improve accuracy of per base quality scores and to ensure better convergence to the actual probability of mismatching the reference genome.

To minimize false positive variant calls on genomic regions containing Insertions-Deletions (InDels), local realignment of reads around InDels is performed. Afterwards variant calling is performed on the secondary alignment files using three different variant callers. The GATK HaplotypeCaller is applied using standard parameters and if suitable limited to the target regions of capture/amplicon-based assays. In addition to GATK Haplotypecaller two other commonly used variant callers, freebayes and samtools, are applied.

Variants are annotated using Annovar7 and in-house ad hoc bioinformatics tools. Alignments are visually verified with the Integrative Genomics Viewer v.2.38 and Alamut v.2.4.5 (Interactive Biosoftware, Rouen, France).


