Break Dance in the Genomes
Development of the Computational Tool BreakDancer
- Fig. 1: Overview of BreakDancer SV detection and confirmation pipeline. Five types of SVs predicted by BreakDancerMax: deletion, insertion, inversion, intra-chromosomal translocation, and inter-chromosomal translocation. A pair of arrows represents the location and the orientation of a read pair. A dotted line represents a chromosome in the subject genome. A solid line represents a chromosome in the reference genome.
- Fig. 2: Size distribution of deletions detected in an AML genome. 3170 deletions were detected from the sequence data by BreakDancerMax ranging from 58 bp to 959,498 bp. Two signature peaks at 300 bp and at 6,000 bp correspond respectively to the AluY and the L1Hs retro-transposon. In comparison, only 116 inherited CNVs were detected using Affymetrix 6.0 array on this sample.
- Ken Chen, PhD, Senior Scientist, The Genome Center, Washington University in St. Louis, USA
Genomic structural variation such as deletion, inversion, and translocation can cause severe genetic disorders such as cancer and Alzheimer's disease. To study the genetics of these diseases, we developed a computational tool BreakDancer that detects a wide variety of structural variants in the human genome. We applied BreakDancer's to our cancer genome projects and to the 1,000 Genomes project. We found that the tool substantially improved the state of the art of SV detection, and that it is especially effective at detecting both small and large indels from 10 bp to 1 Mbp.
Genomic structural variation (SV) is commonly considered to be any DNA sequence alteration other than a single nucleotide substitution . Instances of SVs in germ and somatic cells contribute respectively to various heritable genetic diseases and cancers . Numerous types of SVs exist in human, including deletion, insertion, duplication, inversion, and translocation. Many SVs have been previously discovered using array comparative genomic hybridization (CGH) and fosmid end sequence mapping. However, these technologies have limited resolution and are particularly powerless at determining small variants from 10 bp to 5 kb.
Recent advance in massively parallel next-generation sequencing has offered an opportunity to revolutionize the discovery of SVs. One widely used instrument, the Genome Analyzer (GA) II (Illumina), sequences the ends of short DNA fragments between 100 and 500 bp and requires little input DNA (~1 µg) to achieve 20 to 40 fold sequence coverage. Many novel SVs have been recently discovered in the recent whole genome resequencing projects using this platform .
The detection algorithms applied in these projects, however, have been largely ad hoc. Many computational issues regarding the analysis of paired-end mapping and the utility of the short fragments remain unresolved. Open questions include whether the heuristics and parameters established for long fragments can be extrapolated to short fragments, how false positive and negative rates vary with respect to coverage, fragment size, read length, and mapping accuracy, and how prediction confidence should be estimated.
As whole genome next generation sequencing (WGNGS) begins to dominate genome resequencing projects, there is a pressing need both to advance the analysis algorithms and to provide practical tools for data analysis.
We developed a SV discovery pipeline that conducts de novo prediction and in silico confirmation from the WGNGS paired end data. The de novo prediction program, BreakDancer  consists of two complementary algorithms that examine the alignments produced by a next generation short reads aligner such as MAQ  (fig. 1). The first algorithm, BreakDancerMax identifies anomalously mapped fragments (ARFs) whose ends are mapped in unexpected distances or orientations. It searches for genomic regions that anchor significantly more ARFs than expected by chance and derive putative SVs from the identification of one or more regions that are interconnected by at least two ARFs. It then estimates confidence score using a Poisson model that takes into consideration the number of supporting ARFs, the size of the anchoring regions and the coverage of the genome. Finally, it outputs five types of SVs: large deletion (>100 bp), insertion, inversion, intra-chromosomal and inter-chromosomal translocation.
To utilize the high redundancy in the WGNGS data, we developed the second algorithm BreakDancerMini that predicts small indels (10 bp to 100 bp) by examining the normally mapped fragments (NRFs) that are ignored by BreakDancerMax. NRFs are not informative individually, however, in groups, they often confer sufficient statistical power to distinguish sequence-altered regions from unaltered ones. BreakDancerMini identifies such regions using Kolmogorov-Smirnov test and derives SVs using procedures similar to BreakDancerMax. Both the Max and Mini algorithms can be applied to a pool of DNA samples to identify the common and the novel variants.
We further confirm the putative SVs by assembling all read pairs that have at least one end mapped to the predicted intervals. We find that our assembler Tigra (Chen et al., unpublished) can achieve confirmation rate as high as 93%. This in silico confirmation process returns the exact locations of the SVs and the nucleotide sequences that span the SV breakpoints.
We examined BreakDancer's performance using paired end data from the first WGNGS African genome and obtained a non-redundant set of 27,092 deletions, 19,305 insertions, and 665 inversions. BreakDancer identified similar number of large deletions but 15-20 times more small indels than two other tools [6, 7], judged on previously validated results .
We performed SV detection on the genomes of an individual with cytogenetically normal acute myeloid leukemia (AML) . We jointly analyzed 42 x short read data from both the tumor and the normal samples and obtained 7087 putative variants. The size distribution of the 3,170 deletions contained two spikes at 300 bp and 6 kb, produced respectively by the AluY and L1Hs elements (fig. 2). After removing germline variants that were detected in both the tumor and the normal samples, we derived a set of 223 putative somatic variants that include 100 deletions, 67 insertions (< 100 bp), 22 inversions, and 34 intra-chromosomal translocations. Our assembly approach confirmed 100 (60%) of the 167 indels. We further submitted the entire set of 167 indels for PCR resequencing. 110 (69 deletions and 41 insertions) were validated both in the tumor and in the normal, indicating a 78% validation rate.
We applied BreakDancer to the 1,000 Genomes Project  and discovered and confirmed thousands of common SVs ranging from 50 bp to 1M bp at base pair resolution.
Our study indicates that BreakDancer has achieved high-throughput and accurate SV discovery in the human genomes and has greatly advanced the mutational profiling of cancer genomes. Some types of SVs, such as inversions and translocations, appeared to be more difficult to detect and validate. Many putative predictions overlapped with regions of complex repeats and required further analysis and filtering. Nonetheless, BreakDancer has been able to identify bona fide instances of inversions and translocations in our study of glioblastoma multiforme, ovarian and breast cancers, and AMLs.
The algorithms we implemented in BreakDancer are generic and can potentially be expanded to analyze data of different insert sizes or produced by different sequencing platform. It can also be expanded to analyze paired-end data obtained from mRNA sequencing to identify instances of gene fusion and alternative splicing.
 Feuk L. et al.: Nat Rev Genet 7, 85 (2006)
 Mitelman F. et al.: Nat Rev Cancer 7, 233 (2007)
 Bentley D. R. et al.: Nature 456, 53 (2008)
 Chen K. et al.: Nat Methods 6, 677 (2009)
 Li H. et al.: Genome Res 18, 1851 (2008)
 Hormozdiari F. et al.: Genome Res 19, 1270 (2009)
 Lee S. et al.: Nat Methods 6, 473 (2009)
 Kidd J. M. et al.: Nature 453, 56 (2008)
 Mardis E. R. et al.: N Engl J Med 361, 1058 (2009)
 Kaiser J.: Science 319, 395 (2008)