File Download
 
Links for fulltext
(May Require Subscription)
 
Supplementary

Conference Paper: Indexing similar DNA sequences
  • Basic View
  • Metadata View
  • XML View
TitleIndexing similar DNA sequences
 
AuthorsHuang, S1
Lam, TW1
Sung, WK2
Tam, SL1
Yiu, SM1
 
KeywordsBasic operation
Genetic variation
Genomic sequence
Memory requirements
R-sequences
 
Issue Date2010
 
PublisherSpringer Verlag. The Journal's web site is located at http://springerlink.com/content/105633/
 
CitationThe 6th International Conference on Algorithmic Aspects in Information and Management (AAIM 2010), Shandong, China, 19-21 July 2010. In Lecture Notes in Computer Science, v. 6124, p. 180-190 [How to Cite?]
DOI: http://dx.doi.org/10.1007/978-3-642-14355-7_19
 
AbstractTo study the genetic variations of a species, one basic operation is to search for occurrences of patterns in a large number of very similar genomic sequences. To build an indexing data structure on the concatenation of all sequences may require a lot of memory. In this paper, we propose a new scheme to index highly similar sequences by taking advantage of the similarity among the sequences. To store r sequences with k common segments, our index requires only O(n + N logN) bits of memory, where n is the total length of the common segments and N is the total length of the distinct regions in all texts. The total length of all sequences is rn + N, and any scheme to store these sequences requires Ω(n + N) bits. Searching for a pattern P of length m takes O(m+mlogN +mlog(rk)psc(P)+occ log n), where psc(P) is the number of prefixes of P that appear as a suffix of some common segments and occ is the number of occurrences of P in all sequences. In practice, rk ≤ N, and psc(P) is usually a small constant. We have implemented our solution1 and evaluated our solution using real DNA sequences. The experiments show that the memory requirement of our solution is much less than that required by BWT built on the concatenation of all sequences. When compared to the other existing solution (RLCSA), we use less memory with faster searching time. © Springer-Verlag Berlin Heidelberg 2010.
 
DescriptionLNCS v. 6124 is Proceedings of the 6th International Conference, AAIM 2010
 
ISSN0302-9743
2013 SCImago Journal Rankings: 0.310
 
DOIhttp://dx.doi.org/10.1007/978-3-642-14355-7_19
 
ReferencesReferences in Scopus
 
DC FieldValue
dc.contributor.authorHuang, S
 
dc.contributor.authorLam, TW
 
dc.contributor.authorSung, WK
 
dc.contributor.authorTam, SL
 
dc.contributor.authorYiu, SM
 
dc.date.accessioned2011-09-23T06:04:20Z
 
dc.date.available2011-09-23T06:04:20Z
 
dc.date.issued2010
 
dc.description.abstractTo study the genetic variations of a species, one basic operation is to search for occurrences of patterns in a large number of very similar genomic sequences. To build an indexing data structure on the concatenation of all sequences may require a lot of memory. In this paper, we propose a new scheme to index highly similar sequences by taking advantage of the similarity among the sequences. To store r sequences with k common segments, our index requires only O(n + N logN) bits of memory, where n is the total length of the common segments and N is the total length of the distinct regions in all texts. The total length of all sequences is rn + N, and any scheme to store these sequences requires Ω(n + N) bits. Searching for a pattern P of length m takes O(m+mlogN +mlog(rk)psc(P)+occ log n), where psc(P) is the number of prefixes of P that appear as a suffix of some common segments and occ is the number of occurrences of P in all sequences. In practice, rk ≤ N, and psc(P) is usually a small constant. We have implemented our solution1 and evaluated our solution using real DNA sequences. The experiments show that the memory requirement of our solution is much less than that required by BWT built on the concatenation of all sequences. When compared to the other existing solution (RLCSA), we use less memory with faster searching time. © Springer-Verlag Berlin Heidelberg 2010.
 
dc.description.natureLink_to_subscribed_fulltext
 
dc.descriptionLNCS v. 6124 is Proceedings of the 6th International Conference, AAIM 2010
 
dc.description.otherThe 6th International Conference on Algorithmic Aspects in Information and Management (AAIM 2010), Shandong, China, 19-21 July 2010. In Lecture Notes in Computer Science, v. 6124, p. 180-190
 
dc.identifier.citationThe 6th International Conference on Algorithmic Aspects in Information and Management (AAIM 2010), Shandong, China, 19-21 July 2010. In Lecture Notes in Computer Science, v. 6124, p. 180-190 [How to Cite?]
DOI: http://dx.doi.org/10.1007/978-3-642-14355-7_19
 
dc.identifier.doihttp://dx.doi.org/10.1007/978-3-642-14355-7_19
 
dc.identifier.epage190
 
dc.identifier.hkuros192198
 
dc.identifier.issn0302-9743
2013 SCImago Journal Rankings: 0.310
 
dc.identifier.scopuseid_2-s2.0-79956310837
 
dc.identifier.spage180
 
dc.identifier.urihttp://hdl.handle.net/10722/139980
 
dc.identifier.volume6124
 
dc.languageeng
 
dc.publisherSpringer Verlag. The Journal's web site is located at http://springerlink.com/content/105633/
 
dc.publisher.placeGermany
 
dc.relation.ispartofLecture Notes in Computer Science
 
dc.relation.referencesReferences in Scopus
 
dc.rightsThe original publication is available at www.springerlink.com
 
dc.subjectBasic operation
 
dc.subjectGenetic variation
 
dc.subjectGenomic sequence
 
dc.subjectMemory requirements
 
dc.subjectR-sequences
 
dc.titleIndexing similar DNA sequences
 
dc.typeConference_Paper
 
<?xml encoding="utf-8" version="1.0"?>
<item><contributor.author>Huang, S</contributor.author>
<contributor.author>Lam, TW</contributor.author>
<contributor.author>Sung, WK</contributor.author>
<contributor.author>Tam, SL</contributor.author>
<contributor.author>Yiu, SM</contributor.author>
<date.accessioned>2011-09-23T06:04:20Z</date.accessioned>
<date.available>2011-09-23T06:04:20Z</date.available>
<date.issued>2010</date.issued>
<identifier.citation>The 6th International Conference on Algorithmic Aspects in Information and Management (AAIM 2010), Shandong, China, 19-21 July 2010. In Lecture Notes in Computer Science, v. 6124, p. 180-190</identifier.citation>
<identifier.issn>0302-9743</identifier.issn>
<identifier.uri>http://hdl.handle.net/10722/139980</identifier.uri>
<description>LNCS v. 6124 is Proceedings of the 6th International Conference, AAIM 2010</description>
<description.abstract>To study the genetic variations of a species, one basic operation is to search for occurrences of patterns in a large number of very similar genomic sequences. To build an indexing data structure on the concatenation of all sequences may require a lot of memory. In this paper, we propose a new scheme to index highly similar sequences by taking advantage of the similarity among the sequences. To store r sequences with k common segments, our index requires only O(n + N logN) bits of memory, where n is the total length of the common segments and N is the total length of the distinct regions in all texts. The total length of all sequences is rn + N, and any scheme to store these sequences requires &#937;(n + N) bits. Searching for a pattern P of length m takes O(m+mlogN +mlog(rk)psc(P)+occ log n), where psc(P) is the number of prefixes of P that appear as a suffix of some common segments and occ is the number of occurrences of P in all sequences. In practice, rk &#8804; N, and psc(P) is usually a small constant. We have implemented our solution1 and evaluated our solution using real DNA sequences. The experiments show that the memory requirement of our solution is much less than that required by BWT built on the concatenation of all sequences. When compared to the other existing solution (RLCSA), we use less memory with faster searching time. &#169; Springer-Verlag Berlin Heidelberg 2010.</description.abstract>
<language>eng</language>
<publisher>Springer Verlag. The Journal&apos;s web site is located at http://springerlink.com/content/105633/</publisher>
<relation.ispartof>Lecture Notes in Computer Science</relation.ispartof>
<rights>The original publication is available at www.springerlink.com</rights>
<subject>Basic operation</subject>
<subject>Genetic variation</subject>
<subject>Genomic sequence</subject>
<subject>Memory requirements</subject>
<subject>R-sequences</subject>
<title>Indexing similar DNA sequences</title>
<type>Conference_Paper</type>
<description.nature>Link_to_subscribed_fulltext</description.nature>
<identifier.doi>10.1007/978-3-642-14355-7_19</identifier.doi>
<identifier.scopus>eid_2-s2.0-79956310837</identifier.scopus>
<identifier.hkuros>192198</identifier.hkuros>
<relation.references>http://www.scopus.com/mlt/select.url?eid=2-s2.0-79956310837&amp;selection=ref&amp;src=s&amp;origin=recordpage</relation.references>
<identifier.volume>6124</identifier.volume>
<identifier.spage>180</identifier.spage>
<identifier.epage>190</identifier.epage>
<publisher.place>Germany</publisher.place>
<description.other>The 6th International Conference on Algorithmic Aspects in Information and Management (AAIM 2010), Shandong, China, 19-21 July 2010. In Lecture Notes in Computer Science, v. 6124, p. 180-190</description.other>
</item>
Author Affiliations
  1. The University of Hong Kong
  2. National University of Singapore