Technical Advances Drive Sequencing to Supersonic Speeds
A big part of the science of bioinformatics today spills forth from the revolution in sequencing technology. From ancient times in which laborious hand-performed sequencing gels yielded a few hundred bases in the space of several days, the technology has evolved to a point where modern sequencing machines pour out petabytes of data. This innovation in hardware allows the exploration of scientific questions that would have been inconceivable a decade ago. However, there is a significant downside to the sequencing gold rush, as the generation of huge volumes of data at dramatically lowered costs threatens to overwhelm existing storage capacity.
The first protocols for DNA sequencing were introduced in the late 1970s. The discovery of the polymerase chain reaction (PCR) made possible early sequencing of genes through the Sanger method, originally a grinding manual process. Using PCR, small fragments of DNA could be copied by addition of bases, one at a time, to complement the primer (a short fragment complementary to a single stranded DNA template of interest). The innovation was the use of dideoxynucleotide triphosphate to terminate the elongation reaction of a DNA primer. These molecules cause the DNA elongation reaction to terminate, resulting in a collection of molecules of different lengths, each of which is stopped when the dideoxynucleotide form of the each of four bases is added to that position in the growing chain. When radioactively labelled nucleotides are used in the reaction, the position of a fragment can be determined through autoradiography using large, unwieldy gels to separate the products. After applying X-ray film to the gels, the DNA products of different lengths can be visualised as a ladder running from top to bottom, from which the order of DNA bases could be read.
An early automated version of sequencing technology was the capillary sequencing machine, in which the products were labelled with fluorescent dideoxy tags at the terminal position of the growing DNA fragments and separated in tiny capillary tubes. Using four dyes that fluoresce at different wavelengths, a laser reads the gel, identifying each band by the wavelengths at which it fluoresces.
The data is displayed as a chromatogram, with the peaks corresponding to the nucleotides in that location in the sequence.
The Sanger method was used in the Human Genome Project, despite its drawbacks. These include the need to produce a library of genomic fragments stored in bacterial plasmids as well as the requirements for costly reagents and labour-intensive sample preparation.
Pyrosequencing, a much more rapid approach, has largely supplanted these earlier technologies. It is a sequence of enzymatic reactions in which the incorporation of DNA bases is measured by the release of visible light. A DNA template is incubated with a mix of enzymes, including DNA polymerase and ATP sulfurylase. As the reaction runs to completion, DNA bases are incorporated complementary to the DNA template strand, pyrophosphate is released and converted to ATP by the sulfurylase enzyme, which in turns drives a luciferase reaction, releasing light. The light is detected and recorded, the excess material is washed away and the entire process repeated, base after base, until the whole fragment is sequenced.
Pyrosequencing has emerged as a powerful technology, and diverse hardware is marketed making the instrumentation widely available. These devices include the GS FLX System from 454 Life Sciences, the Genome Analyzer from Illumina, and the SOLiD system from Applied Biosystems; others are in the works, and the warp speed of the industry assures that this analysis will be outdated by the time it appears.
Illuminating Dark Corners of the Genome
The Genome Analyzer is founded on large scale parallel sequencing of millions of fragments using reversible terminator-based sequencing chemistry. This device, which has been adopted by both single investigator laboratories and genome centers worldwide, is based on the attachment of randomly fragmented genomic DNA to a planar, optically transparent surface. These fragments are extended and bridge amplified, in an automated process, to create an ultra-high density sequencing flow cell with ≥100 million clusters, each containing ca. 1,000 copies of the same template. With run times ranging from two to seven days, these templates are sequenced using a four-colour DNA sequencing technology that employs reversible terminators with removable fluorescent dyes.
High-sensitivity fluorescence detection is achieved using laser excitation and total internal reflection optics. Sequence reads of up to 75 base pairs, from each end of the fragment, are aligned against a reference genome and genetic differences are called using specially developed analytical software. Simplicity of the sequencing chemistry has enabled rapid technological advances and new sample prep methodologies. In particular, customers can use the Genome Analyzer for a range of applications, including whole-genome targeted and de novo sequencing, analysis of bisulfite converted DNA, transcriptome profiling, and protein-nucleic acid interactions.
The 454 Approach
The 454 sequencing technology allows bulk sequencing of an entire genome. DNA samples are fragmented and ligated to adaptor molecules which act as primers in a PCR reaction in which a single DNA molecule is replicated on beads until the same sequence has multiplied to ca. 10 million copies. Each bead is deposited into an individual well of a micro-slide, and now behaves as a clone. A loaded slide is fed into the sequencing machine, the reagents are added, the reaction run, and the release of light tallies the DNA sequences for each well on the slide. The data can then be fed into and crunched by a computer. While the technology is capable of reading only short sequences, it can read more than 4,000 sequences in a single run, meaning that the machine can sequence more than 4.5 million bases or an entire bacterial genome, in two days.
The new technologies are different from the old Sanger sequencing method in that they increase throughput by laying out millions of DNA fragments on a single chip and sequencing all of these in parallel. While these devices are much faster, the fragments they sequence are much shorter, which introduces important bioinformatics considerations. The traditional capillary sequencers read up 900 bp at a stretch, whereas the newer machines read between 35 and 250 bps at a time. To connect these short stretches will require overlapping fragments and the software to analyse them. Also, the quality of sequencing from the new technology is not up to that of the Sanger method, so greater demands are placed on the software analysis to detect and correct these errors.
The most recent figures for sequencing a human genome are US-$ 60,000 in about six weeks, as reported by Applied Biosystems last month. (That's down from US-$ 3 billion for the Human Genome Project, which was sequenced using traditional methods and finished in 2003, and about US-$ 1 million for James Watson's genome, sequenced using a newer, high-throughput approach and released last year). But scientists are still racing to develop methods that are fast and cheap enough to allow everyone to get their genomes sequenced, thus truly ushering in the era of personalised medicine. Researchers have shared that 25-30x and beyond is the appropriate depth of coverage for resequencing human genomes. Illumina reports that they have this year they have sequenced the human genome with 25x coverage.
Completing the Story
An innovative approach to drastically lowering the cost of sequencing is under development by Complete Genomics which has adopted a nanotechnology-based system (fig. 1).
According to President and CEO Clifford Reid, the sequencing uses DNA nanoarrays combined with an approach referred to as combinatorial probe anchor ligation. The DNA fragments are clonally amplified in solution into DNA nanoballs using a circle-dependent replication technology. They are then immobilised on a substrate with patterned submicron spacing. Because of the scale-down, the volume of reagents and cost are reduced up to 10 fold. After formation, the DNA nanoballs are sequenced using the company's hybridisation-ligation based sequencing technology. The system, referred to as combinatorial probe-anchor ligation (cPAL), uses a series of matching probes which bind to the genomic DNA, and resolves the inability of previous methods to read simple repeats. The method uses new probe anchor ligation and removal of of the anchor probe product after imaging each base, but without the drawbacks of ligation chaining required by previous methods.
"By splicing in adaptors, the system reads up to 80 bases with high accuracy," Reid stated. "The extremely high efficiency of the system will allow entires genomes to be read for a few thousand dollars."
Into the Void
The new sequencing instrumentation boggles the imagination; yet there are concerns raised by this data production deluge. Consider that Complete Genomics intends to sequence 20,000 human genomes in 2010, 2,000 of them in collaboration with the Institute of Systems Biology. In addition, many new species of bacteria will be sequenced in this period and there have been proposals to sequence the genomes of all available organisms in culture collections.
And this is just the tip of the sequencingberg. But there will an orgy of sequencing coming from private biotech that seeks to address specific for-profit questions.
Moreover, new technologies are waiting in the wings that may challenge these existing and maturing technologies. Nanoknife edge sequencing uses tiny knife edge probes that interrogate DNA with the potential to sequence a genome error-free in minutes. Sequencing-by-synthesis chemistry (SBS) uses single-stranded DNA templates on an array and incorporates fluorescently labelled nucleotide analogs into complementary strands. The instrument could potentially sequence 5GB per day and sell for ca. US-$ 300K. A third possibility is nanopore array-based systems, a technology used to detect the location of hybridisation probes.
This overwhelming amount of data arising from this technology could cause massive road blocks in the system. Because if there is no way to analyse, store and retrieve the data from these multiple sites all over the world, if there are no overreaching models to guide its understanding then it is unlikely that will add to the advance of clinical medicine, and it may become another elephant in the parlour, as some of the last decade of aimless genome and proteome sequencing has become.