File Download
 
Links for fulltext
(May Require Subscription)
 
Supplementary

Article: Compressed indexing and local alignment of DNA
  • Basic View
  • Metadata View
  • XML View
TitleCompressed indexing and local alignment of DNA
 
AuthorsLam, TW1
Sung, WK2
Tam, SL1
Wong, CK1
Yiu, SM1
 
Issue Date2008
 
PublisherOxford University Press. The Journal's web site is located at http://bioinformatics.oxfordjournals.org/
 
CitationBioinformatics, 2008, v. 24 n. 6, p. 791-797 [How to Cite?]
DOI: http://dx.doi.org/10.1093/bioinformatics/btn032
 
AbstractMotivation: Recent experimental studies on compressed indexes (BWT, CSA, FM-index) have confirmed their practicality for indexing very long strings such as the human genome in the main memory. For example, a BWT index for the human genome (with about 3 billion characters) occupies just around 1 G bytes. However, these indexes are designed for exact pattern matching, which is too stringent for biological applications. The demand is often on finding local alignments (pairs of similar substrings with gaps allowed). Without indexing, one can use dynamic programming to find all the local alignments between a text T and a pattern P in O ( T P ) time, but this would be too slow when the text is of genome scale (e.g. aligning a gene with the human genome would take tens to hundreds of hours). In practice, biologists use heuristic-based software such as BLAST, which is very efficient but does not guarantee to find all local alignments. Results: In this article, we show how to build a software called BWT-SW that exploits a BWT index of a text T to speed up the dynamic programming for finding all local alignments. Experiments reveal that BWT-SW is very efficient (e.g. aligning a pattern of length 3 000 with the human genome takes less than a minute). We have also analyzed BWT-SW mathematically for a simpler similarity model (with gaps disallowed), and we show that the expected running time is O ( T 0.628 P ) for random strings. As far as we know, BWT-SW is the first practical tool that can find all local alignments. Yet BWT-SW is not meant to be a replacement of BLAST, as BLAST is still several times faster than BWT-SW for long patterns and BLAST is indeed accurate enough in most cases (we have used BWT-SW to check against the accuracy of BLAST and found that only rarely BLAST would miss some significant alignments). © The Author 2008. Published by Oxford University Press. All rights reserved.
 
ISSN1367-4803
2012 Impact Factor: 5.323
2012 SCImago Journal Rankings: 4.223
 
DOIhttp://dx.doi.org/10.1093/bioinformatics/btn032
 
ISI Accession Number IDWOS:000254010400008
 
ReferencesReferences in Scopus
 
DC FieldValue
dc.contributor.authorLam, TW
 
dc.contributor.authorSung, WK
 
dc.contributor.authorTam, SL
 
dc.contributor.authorWong, CK
 
dc.contributor.authorYiu, SM
 
dc.date.accessioned2010-09-06T09:52:53Z
 
dc.date.available2010-09-06T09:52:53Z
 
dc.date.issued2008
 
dc.description.abstractMotivation: Recent experimental studies on compressed indexes (BWT, CSA, FM-index) have confirmed their practicality for indexing very long strings such as the human genome in the main memory. For example, a BWT index for the human genome (with about 3 billion characters) occupies just around 1 G bytes. However, these indexes are designed for exact pattern matching, which is too stringent for biological applications. The demand is often on finding local alignments (pairs of similar substrings with gaps allowed). Without indexing, one can use dynamic programming to find all the local alignments between a text T and a pattern P in O ( T P ) time, but this would be too slow when the text is of genome scale (e.g. aligning a gene with the human genome would take tens to hundreds of hours). In practice, biologists use heuristic-based software such as BLAST, which is very efficient but does not guarantee to find all local alignments. Results: In this article, we show how to build a software called BWT-SW that exploits a BWT index of a text T to speed up the dynamic programming for finding all local alignments. Experiments reveal that BWT-SW is very efficient (e.g. aligning a pattern of length 3 000 with the human genome takes less than a minute). We have also analyzed BWT-SW mathematically for a simpler similarity model (with gaps disallowed), and we show that the expected running time is O ( T 0.628 P ) for random strings. As far as we know, BWT-SW is the first practical tool that can find all local alignments. Yet BWT-SW is not meant to be a replacement of BLAST, as BLAST is still several times faster than BWT-SW for long patterns and BLAST is indeed accurate enough in most cases (we have used BWT-SW to check against the accuracy of BLAST and found that only rarely BLAST would miss some significant alignments). © The Author 2008. Published by Oxford University Press. All rights reserved.
 
dc.description.natureLink_to_subscribed_fulltext
 
dc.identifier.citationBioinformatics, 2008, v. 24 n. 6, p. 791-797 [How to Cite?]
DOI: http://dx.doi.org/10.1093/bioinformatics/btn032
 
dc.identifier.citeulike2835102
 
dc.identifier.doihttp://dx.doi.org/10.1093/bioinformatics/btn032
 
dc.identifier.eissn1460-2059
 
dc.identifier.epage797
 
dc.identifier.hkuros146733
 
dc.identifier.isiWOS:000254010400008
 
dc.identifier.issn1367-4803
2012 Impact Factor: 5.323
2012 SCImago Journal Rankings: 4.223
 
dc.identifier.issue6
 
dc.identifier.openurl
 
dc.identifier.pmid18227115
 
dc.identifier.scopuseid_2-s2.0-40749108125
 
dc.identifier.spage791
 
dc.identifier.urihttp://hdl.handle.net/10722/89140
 
dc.identifier.volume24
 
dc.languageeng
 
dc.publisherOxford University Press. The Journal's web site is located at http://bioinformatics.oxfordjournals.org/
 
dc.publisher.placeUnited Kingdom
 
dc.relation.ispartofBioinformatics
 
dc.relation.referencesReferences in Scopus
 
dc.rightsBioinformatics. Copyright © Oxford University Press.
 
dc.subject.meshAlgorithms
 
dc.subject.meshBase Sequence
 
dc.subject.meshChromosome Mapping - methods
 
dc.subject.meshDNA - genetics
 
dc.subject.meshData Compression - methods
 
dc.subject.meshGenome, Human - genetics
 
dc.subject.meshHumans
 
dc.subject.meshMolecular Sequence Data
 
dc.subject.meshSequence Alignment - methods
 
dc.subject.meshSequence Analysis, DNA - methods
 
dc.titleCompressed indexing and local alignment of DNA
 
dc.typeArticle
 
<?xml encoding="utf-8" version="1.0"?>
<item><contributor.author>Lam, TW</contributor.author>
<contributor.author>Sung, WK</contributor.author>
<contributor.author>Tam, SL</contributor.author>
<contributor.author>Wong, CK</contributor.author>
<contributor.author>Yiu, SM</contributor.author>
<date.accessioned>2010-09-06T09:52:53Z</date.accessioned>
<date.available>2010-09-06T09:52:53Z</date.available>
<date.issued>2008</date.issued>
<identifier.citation>Bioinformatics, 2008, v. 24 n. 6, p. 791-797</identifier.citation>
<identifier.issn>1367-4803</identifier.issn>
<identifier.uri>http://hdl.handle.net/10722/89140</identifier.uri>
<description.abstract>Motivation: Recent experimental studies on compressed indexes (BWT, CSA, FM-index) have confirmed their practicality for indexing very long strings such as the human genome in the main memory. For example, a BWT index for the human genome (with about 3 billion characters) occupies just around 1 G bytes. However, these indexes are designed for exact pattern matching, which is too stringent for biological applications. The demand is often on finding local alignments (pairs of similar substrings with gaps allowed). Without indexing, one can use dynamic programming to find all the local alignments between a text T and a pattern P in O ( T P ) time, but this would be too slow when the text is of genome scale (e.g. aligning a gene with the human genome would take tens to hundreds of hours). In practice, biologists use heuristic-based software such as BLAST, which is very efficient but does not guarantee to find all local alignments. Results: In this article, we show how to build a software called BWT-SW that exploits a BWT index of a text T to speed up the dynamic programming for finding all local alignments. Experiments reveal that BWT-SW is very efficient (e.g. aligning a pattern of length 3 000 with the human genome takes less than a minute). We have also analyzed BWT-SW mathematically for a simpler similarity model (with gaps disallowed), and we show that the expected running time is O ( T 0.628 P ) for random strings. As far as we know, BWT-SW is the first practical tool that can find all local alignments. Yet BWT-SW is not meant to be a replacement of BLAST, as BLAST is still several times faster than BWT-SW for long patterns and BLAST is indeed accurate enough in most cases (we have used BWT-SW to check against the accuracy of BLAST and found that only rarely BLAST would miss some significant alignments). &#169; The Author 2008. Published by Oxford University Press. All rights reserved.</description.abstract>
<language>eng</language>
<publisher>Oxford University Press. The Journal&apos;s web site is located at http://bioinformatics.oxfordjournals.org/</publisher>
<relation.ispartof>Bioinformatics</relation.ispartof>
<rights>Bioinformatics. Copyright &#169; Oxford University Press.</rights>
<subject.mesh>Algorithms</subject.mesh>
<subject.mesh>Base Sequence</subject.mesh>
<subject.mesh>Chromosome Mapping - methods</subject.mesh>
<subject.mesh>DNA - genetics</subject.mesh>
<subject.mesh>Data Compression - methods</subject.mesh>
<subject.mesh>Genome, Human - genetics</subject.mesh>
<subject.mesh>Humans</subject.mesh>
<subject.mesh>Molecular Sequence Data</subject.mesh>
<subject.mesh>Sequence Alignment - methods</subject.mesh>
<subject.mesh>Sequence Analysis, DNA - methods</subject.mesh>
<title>Compressed indexing and local alignment of DNA</title>
<type>Article</type>
<identifier.openurl>http://library.hku.hk:4550/resserv?sid=HKU:IR&amp;issn=1367-4803&amp;volume=24 &amp;issue=6&amp;spage=791&amp;epage=797&amp;date=2008&amp;atitle=Compressed+indexing+and+local+alignment+of+DNA</identifier.openurl>
<description.nature>Link_to_subscribed_fulltext</description.nature>
<identifier.doi>10.1093/bioinformatics/btn032</identifier.doi>
<identifier.pmid>18227115</identifier.pmid>
<identifier.scopus>eid_2-s2.0-40749108125</identifier.scopus>
<identifier.hkuros>146733</identifier.hkuros>
<relation.references>http://www.scopus.com/mlt/select.url?eid=2-s2.0-40749108125&amp;selection=ref&amp;src=s&amp;origin=recordpage</relation.references>
<identifier.volume>24</identifier.volume>
<identifier.issue>6</identifier.issue>
<identifier.spage>791</identifier.spage>
<identifier.epage>797</identifier.epage>
<identifier.eissn>1460-2059</identifier.eissn>
<identifier.isi>WOS:000254010400008</identifier.isi>
<publisher.place>United Kingdom</publisher.place>
<identifier.citeulike>2835102</identifier.citeulike>
</item>
Author Affiliations
  1. The University of Hong Kong
  2. National University of Singapore