Indexing similar DNA sequences

Huang, S; Lam, TW; Sung, WK; Tam, SL; Yiu, SM

File Download

There are no files associated with this item.

Links for fulltext

(May Require Subscription)

Publisher Website: 10.1007/978-3-642-14355-7_19
Scopus: eid_2-s2.0-79956310837
Find via

Supplementary

Citations:
- Scopus: 0
Appears in Collections:
- Computer Science: Conference papers

Conference Paper: Indexing similar DNA sequences

Title	Indexing similar DNA sequences
Authors	Huang, S Lam, TW Sung, WK Tam, SL Yiu, SM
Keywords	Basic operation Genetic variation Genomic sequence Memory requirements R-sequences
Issue Date	2010
Publisher	Springer Verlag. The Journal's web site is located at http://springerlink.com/content/105633/
Citation	The 6th International Conference on Algorithmic Aspects in Information and Management (AAIM 2010), Shandong, China, 19-21 July 2010. In Lecture Notes in Computer Science, v. 6124, p. 180-190 How to Cite? DOI: http://dx.doi.org/10.1007/978-3-642-14355-7_19
Abstract	To study the genetic variations of a species, one basic operation is to search for occurrences of patterns in a large number of very similar genomic sequences. To build an indexing data structure on the concatenation of all sequences may require a lot of memory. In this paper, we propose a new scheme to index highly similar sequences by taking advantage of the similarity among the sequences. To store r sequences with k common segments, our index requires only O(n + N logN) bits of memory, where n is the total length of the common segments and N is the total length of the distinct regions in all texts. The total length of all sequences is rn + N, and any scheme to store these sequences requires Ω(n + N) bits. Searching for a pattern P of length m takes O(m+mlogN +mlog(rk)psc(P)+occ log n), where psc(P) is the number of prefixes of P that appear as a suffix of some common segments and occ is the number of occurrences of P in all sequences. In practice, rk ≤ N, and psc(P) is usually a small constant. We have implemented our solution1 and evaluated our solution using real DNA sequences. The experiments show that the memory requirement of our solution is much less than that required by BWT built on the concatenation of all sequences. When compared to the other existing solution (RLCSA), we use less memory with faster searching time. © Springer-Verlag Berlin Heidelberg 2010.
Description	LNCS v. 6124 is Proceedings of the 6th International Conference, AAIM 2010
Persistent Identifier	http://hdl.handle.net/10722/139980
ISSN	0302-9743 2023 SCImago Journal Rankings: 0.606
References	References in Scopus

DC Field	Value	Language
dc.contributor.author	Huang, S	en_HK
dc.contributor.author	Lam, TW	en_HK
dc.contributor.author	Sung, WK	en_HK
dc.contributor.author	Tam, SL	en_HK
dc.contributor.author	Yiu, SM	en_HK
dc.date.accessioned	2011-09-23T06:04:20Z	-
dc.date.available	2011-09-23T06:04:20Z	-
dc.date.issued	2010	en_HK
dc.identifier.citation	The 6th International Conference on Algorithmic Aspects in Information and Management (AAIM 2010), Shandong, China, 19-21 July 2010. In Lecture Notes in Computer Science, v. 6124, p. 180-190	en_HK
dc.identifier.issn	0302-9743	en_HK
dc.identifier.uri	http://hdl.handle.net/10722/139980	-
dc.description	LNCS v. 6124 is Proceedings of the 6th International Conference, AAIM 2010	-
dc.description.abstract	To study the genetic variations of a species, one basic operation is to search for occurrences of patterns in a large number of very similar genomic sequences. To build an indexing data structure on the concatenation of all sequences may require a lot of memory. In this paper, we propose a new scheme to index highly similar sequences by taking advantage of the similarity among the sequences. To store r sequences with k common segments, our index requires only O(n + N logN) bits of memory, where n is the total length of the common segments and N is the total length of the distinct regions in all texts. The total length of all sequences is rn + N, and any scheme to store these sequences requires Ω(n + N) bits. Searching for a pattern P of length m takes O(m+mlogN +mlog(rk)psc(P)+occ log n), where psc(P) is the number of prefixes of P that appear as a suffix of some common segments and occ is the number of occurrences of P in all sequences. In practice, rk ≤ N, and psc(P) is usually a small constant. We have implemented our solution1 and evaluated our solution using real DNA sequences. The experiments show that the memory requirement of our solution is much less than that required by BWT built on the concatenation of all sequences. When compared to the other existing solution (RLCSA), we use less memory with faster searching time. © Springer-Verlag Berlin Heidelberg 2010.	en_HK
dc.language	eng	en_US
dc.publisher	Springer Verlag. The Journal's web site is located at http://springerlink.com/content/105633/	en_HK
dc.relation.ispartof	Lecture Notes in Computer Science	en_HK
dc.rights	The original publication is available at www.springerlink.com	en_US
dc.subject	Basic operation	-
dc.subject	Genetic variation	-
dc.subject	Genomic sequence	-
dc.subject	Memory requirements	-
dc.subject	R-sequences	-
dc.title	Indexing similar DNA sequences	en_HK
dc.type	Conference_Paper	en_HK
dc.identifier.email	Lam, TW:twlam@cs.hku.hk	en_HK
dc.identifier.email	Yiu, SM:smyiu@cs.hku.hk	en_HK
dc.identifier.authority	Lam, TW=rp00135	en_HK
dc.identifier.authority	Yiu, SM=rp00207	en_HK
dc.description.nature	link_to_subscribed_fulltext	-
dc.identifier.doi	10.1007/978-3-642-14355-7_19	en_HK
dc.identifier.scopus	eid_2-s2.0-79956310837	en_HK
dc.identifier.hkuros	192198	en_US
dc.relation.references	http://www.scopus.com/mlt/select.url?eid=2-s2.0-79956310837&selection=ref&src=s&origin=recordpage	en_HK
dc.identifier.volume	6124	en_HK
dc.identifier.spage	180	en_HK
dc.identifier.epage	190	en_HK
dc.publisher.place	Germany	en_HK
dc.description.other	The 6th International Conference on Algorithmic Aspects in Information and Management (AAIM 2010), Shandong, China, 19-21 July 2010. In Lecture Notes in Computer Science, v. 6124, p. 180-190	-
dc.identifier.scopusauthorid	Huang, S=37861409000	en_HK
dc.identifier.scopusauthorid	Lam, TW=7202523165	en_HK
dc.identifier.scopusauthorid	Sung, WK=13310059700	en_HK
dc.identifier.scopusauthorid	Tam, SL=14042926200	en_HK
dc.identifier.scopusauthorid	Yiu, SM=7003282240	en_HK
dc.identifier.issnl	0302-9743	-

File Download

Links for fulltext

(May Require Subscription)

Supplementary

Conference Paper: Indexing similar DNA sequences

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats