File Download
 
Links for fulltext
(May Require Subscription)
 
Supplementary

Article: Constructing compressed suffix arrays with large alphabets
  • Basic View
  • Metadata View
  • XML View
TitleConstructing compressed suffix arrays with large alphabets
 
AuthorsHon, WK2
Lam, TW2
Sadakane, K1
Sung, WK3
 
Issue Date2003
 
PublisherSpringer Verlag. The Journal's web site is located at http://springerlink.com/content/105633/
 
CitationLecture Notes In Computer Science (Including Subseries Lecture Notes In Artificial Intelligence And Lecture Notes In Bioinformatics), 2003, v. 2906, p. 240-249 [How to Cite?]
 
AbstractRecent research in compressing suffix arrays has resulted in two breakthrough indexing data structures, namely, compressed suffix arrays (CSA) [7] and FM-index [5]. Either of them makes it feasible to store a full-text index in the main memory even for a piece of text data with a few billion characters (such as human DNA). However, constructing such indexing data structures with limited working memory (i.e., without constructing suffix arrays) is not a trivial task. This paper addresses this problem. Currently, only CSA admits a space-efficient construction algorithm [15]. For a text T of length n over an alphabet Ó, this algorithm requires O(|∑|n log n) time and (2Ho + 1 + ε)n bits of working space, where Ho is the 0-th order empirical entropy of T and ε is any non-zero constant. This algorithm is good enough when the alphabet size |∑| is small. It is not practical for text data containing protein, Chinese or Japanese, where the alphabet may include up to a few thousand characters. The main contribution of this paper is a new algorithm which can construct CSA in O(n log n) time using (Ho + 2 + ε)n bits of working space. Note that the running time of our algorithm is independent of the alphabet size and the space requirement is smaller as it is likely that Ho > 1. This paper also makes contribution to the space-efficient construction of FM-index. We show that FM-index can indeed be constructed from CSA directly in O(n) time. © Springer-Verlag Berlin Heidelberg 2003.
 
ISSN0302-9743
2012 SCImago Journal Rankings: 0.332
 
ReferencesReferences in Scopus
 
DC FieldValue
dc.contributor.authorHon, WK
 
dc.contributor.authorLam, TW
 
dc.contributor.authorSadakane, K
 
dc.contributor.authorSung, WK
 
dc.date.accessioned2010-09-25T15:02:19Z
 
dc.date.available2010-09-25T15:02:19Z
 
dc.date.issued2003
 
dc.description.abstractRecent research in compressing suffix arrays has resulted in two breakthrough indexing data structures, namely, compressed suffix arrays (CSA) [7] and FM-index [5]. Either of them makes it feasible to store a full-text index in the main memory even for a piece of text data with a few billion characters (such as human DNA). However, constructing such indexing data structures with limited working memory (i.e., without constructing suffix arrays) is not a trivial task. This paper addresses this problem. Currently, only CSA admits a space-efficient construction algorithm [15]. For a text T of length n over an alphabet Ó, this algorithm requires O(|∑|n log n) time and (2Ho + 1 + ε)n bits of working space, where Ho is the 0-th order empirical entropy of T and ε is any non-zero constant. This algorithm is good enough when the alphabet size |∑| is small. It is not practical for text data containing protein, Chinese or Japanese, where the alphabet may include up to a few thousand characters. The main contribution of this paper is a new algorithm which can construct CSA in O(n log n) time using (Ho + 2 + ε)n bits of working space. Note that the running time of our algorithm is independent of the alphabet size and the space requirement is smaller as it is likely that Ho > 1. This paper also makes contribution to the space-efficient construction of FM-index. We show that FM-index can indeed be constructed from CSA directly in O(n) time. © Springer-Verlag Berlin Heidelberg 2003.
 
dc.description.natureLink_to_subscribed_fulltext
 
dc.identifier.citationLecture Notes In Computer Science (Including Subseries Lecture Notes In Artificial Intelligence And Lecture Notes In Bioinformatics), 2003, v. 2906, p. 240-249 [How to Cite?]
 
dc.identifier.epage249
 
dc.identifier.hkuros91576
 
dc.identifier.issn0302-9743
2012 SCImago Journal Rankings: 0.332
 
dc.identifier.scopuseid_2-s2.0-35248823623
 
dc.identifier.spage240
 
dc.identifier.urihttp://hdl.handle.net/10722/93476
 
dc.identifier.volume2906
 
dc.languageeng
 
dc.publisherSpringer Verlag. The Journal's web site is located at http://springerlink.com/content/105633/
 
dc.publisher.placeGermany
 
dc.relation.ispartofLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
 
dc.relation.referencesReferences in Scopus
 
dc.titleConstructing compressed suffix arrays with large alphabets
 
dc.typeArticle
 
<?xml encoding="utf-8" version="1.0"?>
<item><contributor.author>Hon, WK</contributor.author>
<contributor.author>Lam, TW</contributor.author>
<contributor.author>Sadakane, K</contributor.author>
<contributor.author>Sung, WK</contributor.author>
<date.accessioned>2010-09-25T15:02:19Z</date.accessioned>
<date.available>2010-09-25T15:02:19Z</date.available>
<date.issued>2003</date.issued>
<identifier.citation>Lecture Notes In Computer Science (Including Subseries Lecture Notes In Artificial Intelligence And Lecture Notes In Bioinformatics), 2003, v. 2906, p. 240-249</identifier.citation>
<identifier.issn>0302-9743</identifier.issn>
<identifier.uri>http://hdl.handle.net/10722/93476</identifier.uri>
<description.abstract>Recent research in compressing suffix arrays has resulted in two breakthrough indexing data structures, namely, compressed suffix arrays (CSA) [7] and FM-index [5]. Either of them makes it feasible to store a full-text index in the main memory even for a piece of text data with a few billion characters (such as human DNA). However, constructing such indexing data structures with limited working memory (i.e., without constructing suffix arrays) is not a trivial task. This paper addresses this problem. Currently, only CSA admits a space-efficient construction algorithm [15]. For a text T of length n over an alphabet &#211;, this algorithm requires O(|&#8721;|n log n) time and (2Ho + 1 + &#949;)n bits of working space, where Ho is the 0-th order empirical entropy of T and &#949; is any non-zero constant. This algorithm is good enough when the alphabet size |&#8721;| is small. It is not practical for text data containing protein, Chinese or Japanese, where the alphabet may include up to a few thousand characters. The main contribution of this paper is a new algorithm which can construct CSA in O(n log n) time using (Ho + 2 + &#949;)n bits of working space. Note that the running time of our algorithm is independent of the alphabet size and the space requirement is smaller as it is likely that Ho &gt; 1. This paper also makes contribution to the space-efficient construction of FM-index. We show that FM-index can indeed be constructed from CSA directly in O(n) time. &#169; Springer-Verlag Berlin Heidelberg 2003.</description.abstract>
<language>eng</language>
<publisher>Springer Verlag. The Journal&apos;s web site is located at http://springerlink.com/content/105633/</publisher>
<relation.ispartof>Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)</relation.ispartof>
<title>Constructing compressed suffix arrays with large alphabets</title>
<type>Article</type>
<description.nature>Link_to_subscribed_fulltext</description.nature>
<identifier.scopus>eid_2-s2.0-35248823623</identifier.scopus>
<identifier.hkuros>91576</identifier.hkuros>
<relation.references>http://www.scopus.com/mlt/select.url?eid=2-s2.0-35248823623&amp;selection=ref&amp;src=s&amp;origin=recordpage</relation.references>
<identifier.volume>2906</identifier.volume>
<identifier.spage>240</identifier.spage>
<identifier.epage>249</identifier.epage>
<publisher.place>Germany</publisher.place>
</item>
Author Affiliations
  1. Kyushu University
  2. The University of Hong Kong
  3. National University of Singapore