Constructing compressed suffix arrays with large alphabets

Hon, WK; Lam, TW; Sadakane, K; Sung, WK

File Download

There are no files associated with this item.

Links for fulltext

(May Require Subscription)

Scopus: eid_2-s2.0-35248823623
Find via

Supplementary

Citations:
- Scopus: 0
Appears in Collections:
- Computer Science: Journal/Magazine Articles

Article: Constructing compressed suffix arrays with large alphabets

Title	Constructing compressed suffix arrays with large alphabets
Authors	Hon, WK Lam, TW Sadakane, K Sung, WK
Issue Date	2003
Publisher	Springer Verlag. The Journal's web site is located at http://springerlink.com/content/105633/
Citation	Lecture Notes In Computer Science (Including Subseries Lecture Notes In Artificial Intelligence And Lecture Notes In Bioinformatics), 2003, v. 2906, p. 240-249 How to Cite?
Abstract	Recent research in compressing suffix arrays has resulted in two breakthrough indexing data structures, namely, compressed suffix arrays (CSA) [7] and FM-index [5]. Either of them makes it feasible to store a full-text index in the main memory even for a piece of text data with a few billion characters (such as human DNA). However, constructing such indexing data structures with limited working memory (i.e., without constructing suffix arrays) is not a trivial task. This paper addresses this problem. Currently, only CSA admits a space-efficient construction algorithm [15]. For a text T of length n over an alphabet Ó, this algorithm requires O(\|∑\|n log n) time and (2Ho + 1 + ε)n bits of working space, where Ho is the 0-th order empirical entropy of T and ε is any non-zero constant. This algorithm is good enough when the alphabet size \|∑\| is small. It is not practical for text data containing protein, Chinese or Japanese, where the alphabet may include up to a few thousand characters. The main contribution of this paper is a new algorithm which can construct CSA in O(n log n) time using (Ho + 2 + ε)n bits of working space. Note that the running time of our algorithm is independent of the alphabet size and the space requirement is smaller as it is likely that Ho > 1. This paper also makes contribution to the space-efficient construction of FM-index. We show that FM-index can indeed be constructed from CSA directly in O(n) time. © Springer-Verlag Berlin Heidelberg 2003.
Persistent Identifier	http://hdl.handle.net/10722/93476
ISSN	0302-9743 2023 SCImago Journal Rankings: 0.606
References	References in Scopus

DC Field	Value	Language
dc.contributor.author	Hon, WK	en_HK
dc.contributor.author	Lam, TW	en_HK
dc.contributor.author	Sadakane, K	en_HK
dc.contributor.author	Sung, WK	en_HK
dc.date.accessioned	2010-09-25T15:02:19Z	-
dc.date.available	2010-09-25T15:02:19Z	-
dc.date.issued	2003	en_HK
dc.identifier.citation	Lecture Notes In Computer Science (Including Subseries Lecture Notes In Artificial Intelligence And Lecture Notes In Bioinformatics), 2003, v. 2906, p. 240-249	en_HK
dc.identifier.issn	0302-9743	en_HK
dc.identifier.uri	http://hdl.handle.net/10722/93476	-
dc.description.abstract	Recent research in compressing suffix arrays has resulted in two breakthrough indexing data structures, namely, compressed suffix arrays (CSA) [7] and FM-index [5]. Either of them makes it feasible to store a full-text index in the main memory even for a piece of text data with a few billion characters (such as human DNA). However, constructing such indexing data structures with limited working memory (i.e., without constructing suffix arrays) is not a trivial task. This paper addresses this problem. Currently, only CSA admits a space-efficient construction algorithm [15]. For a text T of length n over an alphabet Ó, this algorithm requires O(\|∑\|n log n) time and (2Ho + 1 + ε)n bits of working space, where Ho is the 0-th order empirical entropy of T and ε is any non-zero constant. This algorithm is good enough when the alphabet size \|∑\| is small. It is not practical for text data containing protein, Chinese or Japanese, where the alphabet may include up to a few thousand characters. The main contribution of this paper is a new algorithm which can construct CSA in O(n log n) time using (Ho + 2 + ε)n bits of working space. Note that the running time of our algorithm is independent of the alphabet size and the space requirement is smaller as it is likely that Ho > 1. This paper also makes contribution to the space-efficient construction of FM-index. We show that FM-index can indeed be constructed from CSA directly in O(n) time. © Springer-Verlag Berlin Heidelberg 2003.	en_HK
dc.language	eng	en_HK
dc.publisher	Springer Verlag. The Journal's web site is located at http://springerlink.com/content/105633/	en_HK
dc.relation.ispartof	Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)	en_HK
dc.title	Constructing compressed suffix arrays with large alphabets	en_HK
dc.type	Article	en_HK
dc.identifier.email	Lam, TW:twlam@cs.hku.hk	en_HK
dc.identifier.authority	Lam, TW=rp00135	en_HK
dc.description.nature	link_to_subscribed_fulltext	-
dc.identifier.scopus	eid_2-s2.0-35248823623	en_HK
dc.identifier.hkuros	91576	en_HK
dc.relation.references	http://www.scopus.com/mlt/select.url?eid=2-s2.0-35248823623&selection=ref&src=s&origin=recordpage	en_HK
dc.identifier.volume	2906	en_HK
dc.identifier.spage	240	en_HK
dc.identifier.epage	249	en_HK
dc.publisher.place	Germany	en_HK
dc.identifier.scopusauthorid	Hon, WK=7004282818	en_HK
dc.identifier.scopusauthorid	Lam, TW=7202523165	en_HK
dc.identifier.scopusauthorid	Sadakane, K=7005716583	en_HK
dc.identifier.scopusauthorid	Sung, WK=13310059700	en_HK
dc.identifier.issnl	0302-9743	-

File Download

Links for fulltext

(May Require Subscription)

Supplementary

Article: Constructing compressed suffix arrays with large alphabets

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats