You are here: HomeScience OverviewInformation Technology (IT) › Break Dance in the Genomes

Break Dance in the Genomes

Development of the Computational Tool BreakDancer

Jul. 26, 2010
 Fig. 1: Overview of BreakDancer SV detection and confirmation pipeline. Five types of SVs predicted ... Fig. 2: Size distribution of deletions detected in an AML genome. 3170 deletions were detected from ... Ken Chen, PhD, Senior Scientist, The Genome Center, Washington  University in St. Louis, USA 

Genomic structural variation such as deletion, inversion, and translocation can cause severe genetic disorders such as cancer and Alzheimer's disease. To study the genetics of these diseases, we developed a computational tool BreakDancer that detects a wide variety of structural variants in the human genome. We applied BreakDancer's to our cancer genome projects and to the 1,000 Genomes project. We found that the tool substantially improved the state of the art of SV detection, and that it is especially effective at detecting both small and large indels from 10 bp to 1 Mbp.

Introduction

Genomic structural variation (SV) is commonly considered to be any DNA sequence alteration other than a single nucleotide substitution [1]. Instances of SVs in germ and somatic cells contribute respectively to various heritable genetic diseases and cancers [2]. Numerous types of SVs exist in human, including deletion, insertion, duplication, inversion, and translocation. Many SVs have been previously discovered using array comparative genomic hybridization (CGH) and fosmid end sequence mapping. However, these technologies have limited resolution and are particularly powerless at determining small variants from 10 bp to 5 kb.
Recent advance in massively parallel next-generation sequencing has offered an opportunity to revolutionize the discovery of SVs. One widely used instrument, the Genome Analyzer (GA) II (Illumina), sequences the ends of short DNA fragments between 100 and 500 bp and requires little input DNA (~1 µg) to achieve 20 to 40 fold sequence coverage. Many novel SVs have been recently discovered in the recent whole genome resequencing projects using this platform [3].
The detection algorithms applied in these projects, however, have been largely ad hoc. Many computational issues regarding the analysis of paired-end mapping and the utility of the short fragments remain unresolved. Open questions include whether the heuristics and parameters established for long fragments can be extrapolated to short fragments, how false positive and negative rates vary with respect to coverage, fragment size, read length, and mapping accuracy, and how prediction confidence should be estimated.

As whole genome next generation sequencing (WGNGS) begins to dominate genome resequencing projects, there is a pressing need both to advance the analysis algorithms and to provide practical tools for data analysis.

Methods

We developed a SV discovery pipeline that conducts de novo prediction and in silico confirmation from the WGNGS paired end data. The de novo prediction program, BreakDancer [4] consists of two complementary algorithms that examine the alignments produced by a next generation short reads aligner such as MAQ [5] (fig. 1). The first algorithm, BreakDancerMax identifies anomalously mapped fragments (ARFs) whose ends are mapped in unexpected distances or orientations. It searches for genomic regions that anchor significantly more ARFs than expected by chance and derive putative SVs from the identification of one or more regions that are interconnected by at least two ARFs. It then estimates confidence score using a Poisson model that takes into consideration the number of supporting ARFs, the size of the anchoring regions and the coverage of the genome. Finally, it outputs five types of SVs: large deletion (>100 bp), insertion, inversion, intra-chromosomal and inter-chromosomal translocation.
To utilize the high redundancy in the WGNGS data, we developed the second algorithm BreakDancerMini that predicts small indels (10 bp to 100 bp) by examining the normally mapped fragments (NRFs) that are ignored by BreakDancerMax. NRFs are not informative individually, however, in groups, they often confer sufficient statistical power to distinguish sequence-altered regions from unaltered ones. BreakDancerMini identifies such regions using Kolmogorov-Smirnov test and derives SVs using procedures similar to BreakDancerMax. Both the Max and Mini algorithms can be applied to a pool of DNA samples to identify the common and the novel variants.

We further confirm the putative SVs by assembling all read pairs that have at least one end mapped to the predicted intervals. We find that our assembler Tigra (Chen et al., unpublished) can achieve confirmation rate as high as 93%. This in silico confirmation process returns the exact locations of the SVs and the nucleotide sequences that span the SV breakpoints.

Results

We examined BreakDancer's performance using paired end data from the first WGNGS African genome and obtained a non-redundant set of 27,092 deletions, 19,305 insertions, and 665 inversions. BreakDancer identified similar number of large deletions but 15-20 times more small indels than two other tools [6, 7], judged on previously validated results [8].
We performed SV detection on the genomes of an individual with cytogenetically normal acute myeloid leukemia (AML) [9]. We jointly analyzed 42 x short read data from both the tumor and the normal samples and obtained 7087 putative variants. The size distribution of the 3,170 deletions contained two spikes at 300 bp and 6 kb, produced respectively by the AluY and L1Hs elements (fig. 2). After removing germline variants that were detected in both the tumor and the normal samples, we derived a set of 223 putative somatic variants that include 100 deletions, 67 insertions (< 100 bp), 22 inversions, and 34 intra-chromosomal translocations. Our assembly approach confirmed 100 (60%) of the 167 indels. We further submitted the entire set of 167 indels for PCR resequencing. 110 (69 deletions and 41 insertions) were validated both in the tumor and in the normal, indicating a 78% validation rate.
We applied BreakDancer to the 1,000 Genomes Project [10] and discovered and confirmed thousands of common SVs ranging from 50 bp to 1M bp at base pair resolution.

Related Articles :

Keywords : Biotechnology BreakDancer Genome Analyzer Illumina Information Technology IT Ken Chen Next Generation Sequencing Software

Email requestCompany Homepage

Washington University School of Medicine
4444Forest Park Avenue
St. Louis, MO 63108
USA

Web: http://genome.wustl.edu/people/chen_ken

RSS Newsletter