File Download

There are no files associated with this item.

  Links for fulltext
     (May Require Subscription)
Supplementary

Article: Constructing compressed suffix arrays with large alphabets

TitleConstructing compressed suffix arrays with large alphabets
Authors
Issue Date2003
PublisherSpringer Verlag. The Journal's web site is located at http://springerlink.com/content/105633/
Citation
Lecture Notes In Computer Science (Including Subseries Lecture Notes In Artificial Intelligence And Lecture Notes In Bioinformatics), 2003, v. 2906, p. 240-249 How to Cite?
Abstract
Recent research in compressing suffix arrays has resulted in two breakthrough indexing data structures, namely, compressed suffix arrays (CSA) [7] and FM-index [5]. Either of them makes it feasible to store a full-text index in the main memory even for a piece of text data with a few billion characters (such as human DNA). However, constructing such indexing data structures with limited working memory (i.e., without constructing suffix arrays) is not a trivial task. This paper addresses this problem. Currently, only CSA admits a space-efficient construction algorithm [15]. For a text T of length n over an alphabet Ó, this algorithm requires O(|∑|n log n) time and (2Ho + 1 + ε)n bits of working space, where Ho is the 0-th order empirical entropy of T and ε is any non-zero constant. This algorithm is good enough when the alphabet size |∑| is small. It is not practical for text data containing protein, Chinese or Japanese, where the alphabet may include up to a few thousand characters. The main contribution of this paper is a new algorithm which can construct CSA in O(n log n) time using (Ho + 2 + ε)n bits of working space. Note that the running time of our algorithm is independent of the alphabet size and the space requirement is smaller as it is likely that Ho > 1. This paper also makes contribution to the space-efficient construction of FM-index. We show that FM-index can indeed be constructed from CSA directly in O(n) time. © Springer-Verlag Berlin Heidelberg 2003.
Persistent Identifierhttp://hdl.handle.net/10722/93476
ISSN
2013 SCImago Journal Rankings: 0.310
References

 

Author Affiliations
  1. Kyushu University
  2. The University of Hong Kong
  3. National University of Singapore
DC FieldValueLanguage
dc.contributor.authorHon, WKen_HK
dc.contributor.authorLam, TWen_HK
dc.contributor.authorSadakane, Ken_HK
dc.contributor.authorSung, WKen_HK
dc.date.accessioned2010-09-25T15:02:19Z-
dc.date.available2010-09-25T15:02:19Z-
dc.date.issued2003en_HK
dc.identifier.citationLecture Notes In Computer Science (Including Subseries Lecture Notes In Artificial Intelligence And Lecture Notes In Bioinformatics), 2003, v. 2906, p. 240-249en_HK
dc.identifier.issn0302-9743en_HK
dc.identifier.urihttp://hdl.handle.net/10722/93476-
dc.description.abstractRecent research in compressing suffix arrays has resulted in two breakthrough indexing data structures, namely, compressed suffix arrays (CSA) [7] and FM-index [5]. Either of them makes it feasible to store a full-text index in the main memory even for a piece of text data with a few billion characters (such as human DNA). However, constructing such indexing data structures with limited working memory (i.e., without constructing suffix arrays) is not a trivial task. This paper addresses this problem. Currently, only CSA admits a space-efficient construction algorithm [15]. For a text T of length n over an alphabet Ó, this algorithm requires O(|∑|n log n) time and (2Ho + 1 + ε)n bits of working space, where Ho is the 0-th order empirical entropy of T and ε is any non-zero constant. This algorithm is good enough when the alphabet size |∑| is small. It is not practical for text data containing protein, Chinese or Japanese, where the alphabet may include up to a few thousand characters. The main contribution of this paper is a new algorithm which can construct CSA in O(n log n) time using (Ho + 2 + ε)n bits of working space. Note that the running time of our algorithm is independent of the alphabet size and the space requirement is smaller as it is likely that Ho > 1. This paper also makes contribution to the space-efficient construction of FM-index. We show that FM-index can indeed be constructed from CSA directly in O(n) time. © Springer-Verlag Berlin Heidelberg 2003.en_HK
dc.languageengen_HK
dc.publisherSpringer Verlag. The Journal's web site is located at http://springerlink.com/content/105633/en_HK
dc.relation.ispartofLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)en_HK
dc.titleConstructing compressed suffix arrays with large alphabetsen_HK
dc.typeArticleen_HK
dc.identifier.emailLam, TW:twlam@cs.hku.hken_HK
dc.identifier.authorityLam, TW=rp00135en_HK
dc.description.naturelink_to_subscribed_fulltext-
dc.identifier.scopuseid_2-s2.0-35248823623en_HK
dc.identifier.hkuros91576en_HK
dc.relation.referenceshttp://www.scopus.com/mlt/select.url?eid=2-s2.0-35248823623&selection=ref&src=s&origin=recordpageen_HK
dc.identifier.volume2906en_HK
dc.identifier.spage240en_HK
dc.identifier.epage249en_HK
dc.publisher.placeGermanyen_HK
dc.identifier.scopusauthoridHon, WK=7004282818en_HK
dc.identifier.scopusauthoridLam, TW=7202523165en_HK
dc.identifier.scopusauthoridSadakane, K=7005716583en_HK
dc.identifier.scopusauthoridSung, WK=13310059700en_HK

Export via OAI-PMH Interface in XML Formats


OR


Export to Other Non-XML Formats