In-house bioinformatics pipeline
CENTOGENE’s Illumina Bioinformatics pipeline for the analysis of WES data is based on the 1000 Genomes Project (1000G) data analysis pipeline and GATK best-practice recommendations and is composed from the widely used open source software projects bwa 0.7.10,1 Picard-tools 1.138 and GATK 3.2,2; 3 snpEff 4.1,4 BEDTools 2.25.0,5 bbduk v35.x and samtools 0.1.19,6 freebayes 0.9.20, and custom-developed software (Figure 1).
How it works
Firstly, raw sequencing reads are converted into standard fast format. The quality of the sequencing reads is assessed for checking of any deviation from expected quality that would prevent from further analysis of sequencing data. Then short reads are aligned to the currently validated build of the human reference genome using BWA software with the MEM algorithm.
The alignments are converted to binary BAM file format, sorted on the fly and deduplicated without intermediate input-output-operations to temporary files to achieve maximal performance. The primary alignment files for each sample are further refined and augmented by additional information following GATK best-practices recommendations. Base Quality Score Recalibration (BQSR) is applied to improve accuracy of per base quality scores and to ensure better convergence to the actual probability of mismatching the reference genome.
To minimize false positive variant calls on genomic regions containing Insertions-Deletions (InDels), local realignment of reads around InDels is performed. Afterwards variant calling is performed on the secondary alignment files using three different variant callers. The GATK HaplotypeCaller is applied using standard parameters and if suitable limited to the target regions of capture/amplicon-based assays. In addition to GATK Haplotypecaller two other commonly used variant callers, freebayes and samtools, are applied.
Variants are annotated using Annovar7 and in-house ad hoc bioinformatics tools. Alignments are visually verified with the Integrative Genomics Viewer v.2.38 and Alamut v.2.4.5 (Interactive Biosoftware, Rouen, France).
- Li H, and Durbin R (2009). Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754-1760.
- DePristo MA. Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, Philippakis AA, del Angel G, Rivas MA, Hanna M, et al. (2011). A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet 43, 491-498.
- McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M, et al. (2010). The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res 20, 1297-1303.
- Cingolani P, Platts A, Wang lL, Coon M, Nguyen T, Wang L, Land SJ, Lu X, and Ruden DM (2012). A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly (Austin) 6, 80-92.
- Quinlan AR, and Hall IM (2010). BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841-842.
- Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, and Durbin R (2009). The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078-2079.
- Wang K, Li M, and Hakonarson H (2010). ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. In Nucleic Acids Res. (England), p e164.
- Robinson JT, Thorvaldsdottir H, Winckler W, Guttman M, Lander ES, Getz G, and Mesirov JP (2011). Integrative genomics viewer. In Nat Biotechnol. (United States), pp 24-26.