Article: Compressed indexing and local alignment of DNA

File Download Links for fulltext
(May Require Subscription)
Supplementary
  • Basic View
  • Metadata View
  • XML View
TitleCompressed indexing and local alignment of DNA
AuthorsLam, TW1
Sung, WK2
Tam, SL1
Wong, CK1
Yiu, SM1
Issue Date2008
PublisherOxford University Press. The Journal's web site is located at http://bioinformatics.oxfordjournals.org/
CitationBioinformatics, 2008, v. 24 n. 6, p. 791-797 [How to Cite?]
DOI: http://dx.doi.org/10.1093/bioinformatics/btn032
AbstractMotivation: Recent experimental studies on compressed indexes (BWT, CSA, FM-index) have confirmed their practicality for indexing very long strings such as the human genome in the main memory. For example, a BWT index for the human genome (with about 3 billion characters) occupies just around 1 G bytes. However, these indexes are designed for exact pattern matching, which is too stringent for biological applications. The demand is often on finding local alignments (pairs of similar substrings with gaps allowed). Without indexing, one can use dynamic programming to find all the local alignments between a text T and a pattern P in O ( T P ) time, but this would be too slow when the text is of genome scale (e.g. aligning a gene with the human genome would take tens to hundreds of hours). In practice, biologists use heuristic-based software such as BLAST, which is very efficient but does not guarantee to find all local alignments. Results: In this article, we show how to build a software called BWT-SW that exploits a BWT index of a text T to speed up the dynamic programming for finding all local alignments. Experiments reveal that BWT-SW is very efficient (e.g. aligning a pattern of length 3 000 with the human genome takes less than a minute). We have also analyzed BWT-SW mathematically for a simpler similarity model (with gaps disallowed), and we show that the expected running time is O ( T 0.628 P ) for random strings. As far as we know, BWT-SW is the first practical tool that can find all local alignments. Yet BWT-SW is not meant to be a replacement of BLAST, as BLAST is still several times faster than BWT-SW for long patterns and BLAST is indeed accurate enough in most cases (we have used BWT-SW to check against the accuracy of BLAST and found that only rarely BLAST would miss some significant alignments). © The Author 2008. Published by Oxford University Press. All rights reserved.
ISSN1367-4803
2011 Impact Factor: 5.468
2011 SCImago Journal Rankings: 1.118
DOIhttp://dx.doi.org/10.1093/bioinformatics/btn032
ISI Accession Number IDWOS:000254010400008
ReferencesReferences in Scopus
DC Field
Value
dc.contributor.authorLam, TW
dc.contributor.authorSung, WK
dc.contributor.authorTam, SL
dc.contributor.authorWong, CK
dc.contributor.authorYiu, SM
dc.date.accessioned2010-09-06T09:52:53Z
dc.date.available2010-09-06T09:52:53Z
dc.date.issued2008
dc.description.abstractMotivation: Recent experimental studies on compressed indexes (BWT, CSA, FM-index) have confirmed their practicality for indexing very long strings such as the human genome in the main memory. For example, a BWT index for the human genome (with about 3 billion characters) occupies just around 1 G bytes. However, these indexes are designed for exact pattern matching, which is too stringent for biological applications. The demand is often on finding local alignments (pairs of similar substrings with gaps allowed). Without indexing, one can use dynamic programming to find all the local alignments between a text T and a pattern P in O ( T P ) time, but this would be too slow when the text is of genome scale (e.g. aligning a gene with the human genome would take tens to hundreds of hours). In practice, biologists use heuristic-based software such as BLAST, which is very efficient but does not guarantee to find all local alignments. Results: In this article, we show how to build a software called BWT-SW that exploits a BWT index of a text T to speed up the dynamic programming for finding all local alignments. Experiments reveal that BWT-SW is very efficient (e.g. aligning a pattern of length 3 000 with the human genome takes less than a minute). We have also analyzed BWT-SW mathematically for a simpler similarity model (with gaps disallowed), and we show that the expected running time is O ( T 0.628 P ) for random strings. As far as we know, BWT-SW is the first practical tool that can find all local alignments. Yet BWT-SW is not meant to be a replacement of BLAST, as BLAST is still several times faster than BWT-SW for long patterns and BLAST is indeed accurate enough in most cases (we have used BWT-SW to check against the accuracy of BLAST and found that only rarely BLAST would miss some significant alignments). © The Author 2008. Published by Oxford University Press. All rights reserved.
dc.description.natureLink_to_subscribed_fulltext
dc.identifier.citationBioinformatics, 2008, v. 24 n. 6, p. 791-797 [How to Cite?]
DOI: http://dx.doi.org/10.1093/bioinformatics/btn032
dc.identifier.citeulike2835102
dc.identifier.doihttp://dx.doi.org/10.1093/bioinformatics/btn032
dc.identifier.epage797
dc.identifier.hkuros146733
dc.identifier.isiWOS:000254010400008
dc.identifier.issn1367-4803
2011 Impact Factor: 5.468
2011 SCImago Journal Rankings: 1.118
dc.identifier.issue6
dc.identifier.openurl
dc.identifier.pmid18227115
dc.identifier.scopuseid_2-s2.0-40749108125
dc.identifier.spage791
dc.identifier.urihttp://hdl.handle.net/10722/89140
dc.identifier.volume24
dc.languageeng
dc.publisherOxford University Press. The Journal's web site is located at http://bioinformatics.oxfordjournals.org/
dc.publisher.placeUnited Kingdom
dc.relation.ispartofBioinformatics
dc.relation.referencesReferences in Scopus
dc.rightsBioinformatics. Copyright © Oxford University Press.
dc.subject.meshAlgorithms
dc.subject.meshBase Sequence
dc.subject.meshChromosome Mapping - methods
dc.subject.meshDNA - genetics
dc.subject.meshData Compression - methods
dc.subject.meshGenome, Human - genetics
dc.subject.meshHumans
dc.subject.meshMolecular Sequence Data
dc.subject.meshSequence Alignment - methods
dc.subject.meshSequence Analysis, DNA - methods
dc.titleCompressed indexing and local alignment of DNA
dc.typeArticle
Author Affiliations
  1. The University of Hong Kong
  2. National University of Singapore