![]() |
‘Data deluge' is a term most scientists are familiar with, especially those involved in the handling and analysis of next-generation sequencing data. A single run from an Illumina Hi-seq 2000 sequencer can produce 6 billion sequence reads, thus getting biological information from it can be a challenge. The disk space to store data, the computational resources required for analysis, and the manpower needed to run the infrastructure, handle the data then interpret the results all make it imperative that the model of organization works well and benefits all involved.
Introduction
In the early-2000s an automated DNA sequencer could sequence about 1Mb every two days. Currently, up to 110 gigabases of sequence can be produced in the same time frame, a 100 000-fold increase, and at rapidly decreasing costs. NGS technology produces much shorter sequence reads, often only a tenth of the length of those by the Sanger method (figure 1) despite the overall increase in yield. This disparity means that new analytical and data management techniques need to be used to get the most from the new sequence technologies.
The high yield of sequence comes at a significant cost, namely the accuracy of the base calls in the sequence, thus it is necessary to be very selective as to which data are discarded and which are kept. Next-generation sequences are typically presented in FastQ format (Cock et al. 2010), which as well as containing the DNA sequence also contains an estimate of how accurate each base call is. These ‘quality scores' typically drop towards the end of a sequence and when the quality of the sequences falls too low it is typical to truncate the sequence by trimming the low-quality nucleotides off the end of the sequence. FastQC (Andrews 2010) and the FastX Toolkit (Hannon 2009) are programs that can handle most of these initial QC tasks. Another task in the QC of data is counting k-mers (k length subsequences of the reads). During sequencing each part of the DNA sequence of interest is typically sampled many times, for example generating 50 Mb in sequence reads from a 5Mb bacterial genome means we would expect each such that real k-mers appear 10 times.
Counting the amount of times a given k-mer occurs and removing sequences that contain rare k-mers is a good way to remove errors introduced during sequencing. The Jellyfish tool (Marcais and Kingsford 2011) can be used to count k-mers and the Quake tool (Kelley et al. 2010) can use these counts to remove sequences containing rare k-mers.
New Tools for a New Era
The uptake of NGS technologies by the scientific community has resulted in a new generation of sequence analysis tools, most of which are run by manually typing in commands and parameters from a computer's command-line in order to get the tools to run.
Using the reads from a next-generation dataset in order to determine how the genome of one individual differs from another (sometimes called resequencing), is a typical application for NGS. In resequencing NGS reads will be compared to a reference genome to find the possible places to which the NGS reads align, producing a ‘map' of aligned reads. When the consensus alignment from the reads shows a difference to the reference we have a candidate polymorphism. On many occasions, the genome sequence of a closely-related strain/species is used as a reference and we are able to identify differences between these too.
A number of tools are available for aligning reads to genomes; each has optimal utility depending on the type and size of project involved. Well known tools such as BLAST (Altschul et al. 1990) and BLAT (Kent 2002) worked well with previous sequencing technologies, but new tools have been developed that can align millions of reads to a reference sequence much more quickly. The list of next-generation sequencing aligners is quite large, but Bfast (Homer et al. 2009), Mosaik (Strömberg 2009), Novoalign (Novocraft 2010), Bowtie (Langmead et al. 2009) and BWA (Li and Durbin 2009) are amongst the most popular. The differences are mostly in the way that the sequence is indexed and the model that is used to interpret the errors in the sequence reads.
A common output format is used by these tools, which makes downstream analysis easy. Sequence Alignment Map (SAM) format (Li et al. 2009), is a de facto standard for storing next-generation sequence alignments. It can also be compressed into a binary version (known as BAM) and can be indexed to allow faster random access to regions of the alignment. Many next generation visualisation tools, such as the browsers Savant (Fiume et al. 2010) and IGV (Robinson et al. 2011) use BAM files to look at the mapping and allow the user to interpret results.
Related Articles :
Keywords : ABI ABI 3730xl ABI Solid 5500xl Automation Bioinformatics Daniel MacLean Graham J Etherington Illumina Illumina HiSeq 2000 Next-generation sequencing NGS Roche Roche 454 GS FLX Sequence Data The Sainsbury Laboratory
Email requestThe Sainsbury Laboratory
Web: http://www.tsl.ac.uk/
Reader comments (0)