Article: Constructing compressed suffix arrays with large alphabets

File Download Links for fulltext
(May Require Subscription)
Supplementary
  • Basic View
  • Metadata View
  • XML View
TitleConstructing compressed suffix arrays with large alphabets
AuthorsHon, WK2
Lam, TW2
Sadakane, K1
Sung, WK3
Issue Date2003
PublisherSpringer Verlag. The Journal's web site is located at http://springerlink.com/content/105633/
CitationLecture Notes In Computer Science (Including Subseries Lecture Notes In Artificial Intelligence And Lecture Notes In Bioinformatics), 2003, v. 2906, p. 240-249 [How to Cite?]
AbstractRecent research in compressing suffix arrays has resulted in two breakthrough indexing data structures, namely, compressed suffix arrays (CSA) [7] and FM-index [5]. Either of them makes it feasible to store a full-text index in the main memory even for a piece of text data with a few billion characters (such as human DNA). However, constructing such indexing data structures with limited working memory (i.e., without constructing suffix arrays) is not a trivial task. This paper addresses this problem. Currently, only CSA admits a space-efficient construction algorithm [15]. For a text T of length n over an alphabet Ó, this algorithm requires O(|∑|n log n) time and (2Ho + 1 + ε)n bits of working space, where Ho is the 0-th order empirical entropy of T and ε is any non-zero constant. This algorithm is good enough when the alphabet size |∑| is small. It is not practical for text data containing protein, Chinese or Japanese, where the alphabet may include up to a few thousand characters. The main contribution of this paper is a new algorithm which can construct CSA in O(n log n) time using (Ho + 2 + ε)n bits of working space. Note that the running time of our algorithm is independent of the alphabet size and the space requirement is smaller as it is likely that Ho > 1. This paper also makes contribution to the space-efficient construction of FM-index. We show that FM-index can indeed be constructed from CSA directly in O(n) time. © Springer-Verlag Berlin Heidelberg 2003.
ISSN0302-9743
2011 SCImago Journal Rankings: 0.034
ReferencesReferences in Scopus
DC Field
Value
dc.contributor.authorHon, WK
dc.contributor.authorLam, TW
dc.contributor.authorSadakane, K
dc.contributor.authorSung, WK
dc.date.accessioned2010-09-25T15:02:19Z
dc.date.available2010-09-25T15:02:19Z
dc.date.issued2003
dc.description.abstractRecent research in compressing suffix arrays has resulted in two breakthrough indexing data structures, namely, compressed suffix arrays (CSA) [7] and FM-index [5]. Either of them makes it feasible to store a full-text index in the main memory even for a piece of text data with a few billion characters (such as human DNA). However, constructing such indexing data structures with limited working memory (i.e., without constructing suffix arrays) is not a trivial task. This paper addresses this problem. Currently, only CSA admits a space-efficient construction algorithm [15]. For a text T of length n over an alphabet Ó, this algorithm requires O(|∑|n log n) time and (2Ho + 1 + ε)n bits of working space, where Ho is the 0-th order empirical entropy of T and ε is any non-zero constant. This algorithm is good enough when the alphabet size |∑| is small. It is not practical for text data containing protein, Chinese or Japanese, where the alphabet may include up to a few thousand characters. The main contribution of this paper is a new algorithm which can construct CSA in O(n log n) time using (Ho + 2 + ε)n bits of working space. Note that the running time of our algorithm is independent of the alphabet size and the space requirement is smaller as it is likely that Ho > 1. This paper also makes contribution to the space-efficient construction of FM-index. We show that FM-index can indeed be constructed from CSA directly in O(n) time. © Springer-Verlag Berlin Heidelberg 2003.
dc.description.natureLink_to_subscribed_fulltext
dc.identifier.citationLecture Notes In Computer Science (Including Subseries Lecture Notes In Artificial Intelligence And Lecture Notes In Bioinformatics), 2003, v. 2906, p. 240-249 [How to Cite?]
dc.identifier.epage249
dc.identifier.hkuros91576
dc.identifier.issn0302-9743
2011 SCImago Journal Rankings: 0.034
dc.identifier.scopuseid_2-s2.0-35248823623
dc.identifier.spage240
dc.identifier.urihttp://hdl.handle.net/10722/93476
dc.identifier.volume2906
dc.languageeng
dc.publisherSpringer Verlag. The Journal's web site is located at http://springerlink.com/content/105633/
dc.publisher.placeGermany
dc.relation.ispartofLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
dc.relation.referencesReferences in Scopus
dc.titleConstructing compressed suffix arrays with large alphabets
dc.typeArticle
Author Affiliations
  1. Kyushu University
  2. The University of Hong Kong
  3. National University of Singapore