File Download

There are no files associated with this item.

  Links for fulltext
     (May Require Subscription)
Supplementary

Conference Paper: Indexing similar DNA sequences

TitleIndexing similar DNA sequences
Authors
KeywordsBasic operation
Genetic variation
Genomic sequence
Memory requirements
R-sequences
Issue Date2010
PublisherSpringer Verlag. The Journal's web site is located at http://springerlink.com/content/105633/
Citation
The 6th International Conference on Algorithmic Aspects in Information and Management (AAIM 2010), Shandong, China, 19-21 July 2010. In Lecture Notes in Computer Science, v. 6124, p. 180-190 How to Cite?
Abstract
To study the genetic variations of a species, one basic operation is to search for occurrences of patterns in a large number of very similar genomic sequences. To build an indexing data structure on the concatenation of all sequences may require a lot of memory. In this paper, we propose a new scheme to index highly similar sequences by taking advantage of the similarity among the sequences. To store r sequences with k common segments, our index requires only O(n + N logN) bits of memory, where n is the total length of the common segments and N is the total length of the distinct regions in all texts. The total length of all sequences is rn + N, and any scheme to store these sequences requires Ω(n + N) bits. Searching for a pattern P of length m takes O(m+mlogN +mlog(rk)psc(P)+occ log n), where psc(P) is the number of prefixes of P that appear as a suffix of some common segments and occ is the number of occurrences of P in all sequences. In practice, rk ≤ N, and psc(P) is usually a small constant. We have implemented our solution1 and evaluated our solution using real DNA sequences. The experiments show that the memory requirement of our solution is much less than that required by BWT built on the concatenation of all sequences. When compared to the other existing solution (RLCSA), we use less memory with faster searching time. © Springer-Verlag Berlin Heidelberg 2010.
DescriptionLNCS v. 6124 is Proceedings of the 6th International Conference, AAIM 2010
Persistent Identifierhttp://hdl.handle.net/10722/139980
ISSN
2013 SCImago Journal Rankings: 0.310
References

 

Author Affiliations
  1. The University of Hong Kong
  2. National University of Singapore
DC FieldValueLanguage
dc.contributor.authorHuang, Sen_HK
dc.contributor.authorLam, TWen_HK
dc.contributor.authorSung, WKen_HK
dc.contributor.authorTam, SLen_HK
dc.contributor.authorYiu, SMen_HK
dc.date.accessioned2011-09-23T06:04:20Z-
dc.date.available2011-09-23T06:04:20Z-
dc.date.issued2010en_HK
dc.identifier.citationThe 6th International Conference on Algorithmic Aspects in Information and Management (AAIM 2010), Shandong, China, 19-21 July 2010. In Lecture Notes in Computer Science, v. 6124, p. 180-190en_HK
dc.identifier.issn0302-9743en_HK
dc.identifier.urihttp://hdl.handle.net/10722/139980-
dc.descriptionLNCS v. 6124 is Proceedings of the 6th International Conference, AAIM 2010-
dc.description.abstractTo study the genetic variations of a species, one basic operation is to search for occurrences of patterns in a large number of very similar genomic sequences. To build an indexing data structure on the concatenation of all sequences may require a lot of memory. In this paper, we propose a new scheme to index highly similar sequences by taking advantage of the similarity among the sequences. To store r sequences with k common segments, our index requires only O(n + N logN) bits of memory, where n is the total length of the common segments and N is the total length of the distinct regions in all texts. The total length of all sequences is rn + N, and any scheme to store these sequences requires Ω(n + N) bits. Searching for a pattern P of length m takes O(m+mlogN +mlog(rk)psc(P)+occ log n), where psc(P) is the number of prefixes of P that appear as a suffix of some common segments and occ is the number of occurrences of P in all sequences. In practice, rk ≤ N, and psc(P) is usually a small constant. We have implemented our solution1 and evaluated our solution using real DNA sequences. The experiments show that the memory requirement of our solution is much less than that required by BWT built on the concatenation of all sequences. When compared to the other existing solution (RLCSA), we use less memory with faster searching time. © Springer-Verlag Berlin Heidelberg 2010.en_HK
dc.languageengen_US
dc.publisherSpringer Verlag. The Journal's web site is located at http://springerlink.com/content/105633/en_HK
dc.relation.ispartofLecture Notes in Computer Scienceen_HK
dc.rightsThe original publication is available at www.springerlink.comen_US
dc.subjectBasic operation-
dc.subjectGenetic variation-
dc.subjectGenomic sequence-
dc.subjectMemory requirements-
dc.subjectR-sequences-
dc.titleIndexing similar DNA sequencesen_HK
dc.typeConference_Paperen_HK
dc.identifier.emailLam, TW:twlam@cs.hku.hken_HK
dc.identifier.emailYiu, SM:smyiu@cs.hku.hken_HK
dc.identifier.authorityLam, TW=rp00135en_HK
dc.identifier.authorityYiu, SM=rp00207en_HK
dc.description.naturelink_to_subscribed_fulltext-
dc.identifier.doi10.1007/978-3-642-14355-7_19en_HK
dc.identifier.scopuseid_2-s2.0-79956310837en_HK
dc.identifier.hkuros192198en_US
dc.relation.referenceshttp://www.scopus.com/mlt/select.url?eid=2-s2.0-79956310837&selection=ref&src=s&origin=recordpageen_HK
dc.identifier.volume6124en_HK
dc.identifier.spage180en_HK
dc.identifier.epage190en_HK
dc.publisher.placeGermanyen_HK
dc.description.otherThe 6th International Conference on Algorithmic Aspects in Information and Management (AAIM 2010), Shandong, China, 19-21 July 2010. In Lecture Notes in Computer Science, v. 6124, p. 180-190-
dc.identifier.scopusauthoridHuang, S=37861409000en_HK
dc.identifier.scopusauthoridLam, TW=7202523165en_HK
dc.identifier.scopusauthoridSung, WK=13310059700en_HK
dc.identifier.scopusauthoridTam, SL=14042926200en_HK
dc.identifier.scopusauthoridYiu, SM=7003282240en_HK

Export via OAI-PMH Interface in XML Formats


OR


Export to Other Non-XML Formats