**Article:**Constructing compressed suffix arrays with large alphabets

Title | Constructing compressed suffix arrays with large alphabets |
---|---|

Authors | Hon, WK2 Lam, TW2 Sadakane, K1 Sung, WK3 |

Issue Date | 2003 |

Publisher | Springer Verlag. The Journal's web site is located at http://springerlink.com/content/105633/ |

Citation | Lecture Notes In Computer Science (Including Subseries Lecture Notes In Artificial Intelligence And Lecture Notes In Bioinformatics), 2003, v. 2906, p. 240-249 [How to Cite?] |

Abstract | Recent research in compressing suffix arrays has resulted in two breakthrough indexing data structures, namely, compressed suffix arrays (CSA) [7] and FM-index [5]. Either of them makes it feasible to store a full-text index in the main memory even for a piece of text data with a few billion characters (such as human DNA). However, constructing such indexing data structures with limited working memory (i.e., without constructing suffix arrays) is not a trivial task. This paper addresses this problem. Currently, only CSA admits a space-efficient construction algorithm [15]. For a text T of length n over an alphabet Ó, this algorithm requires O(|∑|n log n) time and (2Ho + 1 + ε)n bits of working space, where Ho is the 0-th order empirical entropy of T and ε is any non-zero constant. This algorithm is good enough when the alphabet size |∑| is small. It is not practical for text data containing protein, Chinese or Japanese, where the alphabet may include up to a few thousand characters. The main contribution of this paper is a new algorithm which can construct CSA in O(n log n) time using (Ho + 2 + ε)n bits of working space. Note that the running time of our algorithm is independent of the alphabet size and the space requirement is smaller as it is likely that Ho > 1. This paper also makes contribution to the space-efficient construction of FM-index. We show that FM-index can indeed be constructed from CSA directly in O(n) time. © Springer-Verlag Berlin Heidelberg 2003. |

ISSN | 0302-9743 2012 SCImago Journal Rankings: 0.332 |

References | References in Scopus |

DC Field | Value |
---|---|

dc.contributor.author | Hon, WK |

dc.contributor.author | Lam, TW |

dc.contributor.author | Sadakane, K |

dc.contributor.author | Sung, WK |

dc.date.accessioned | 2010-09-25T15:02:19Z |

dc.date.available | 2010-09-25T15:02:19Z |

dc.date.issued | 2003 |

dc.description.abstract | Recent research in compressing suffix arrays has resulted in two breakthrough indexing data structures, namely, compressed suffix arrays (CSA) [7] and FM-index [5]. Either of them makes it feasible to store a full-text index in the main memory even for a piece of text data with a few billion characters (such as human DNA). However, constructing such indexing data structures with limited working memory (i.e., without constructing suffix arrays) is not a trivial task. This paper addresses this problem. Currently, only CSA admits a space-efficient construction algorithm [15]. For a text T of length n over an alphabet Ó, this algorithm requires O(|∑|n log n) time and (2Ho + 1 + ε)n bits of working space, where Ho is the 0-th order empirical entropy of T and ε is any non-zero constant. This algorithm is good enough when the alphabet size |∑| is small. It is not practical for text data containing protein, Chinese or Japanese, where the alphabet may include up to a few thousand characters. The main contribution of this paper is a new algorithm which can construct CSA in O(n log n) time using (Ho + 2 + ε)n bits of working space. Note that the running time of our algorithm is independent of the alphabet size and the space requirement is smaller as it is likely that Ho > 1. This paper also makes contribution to the space-efficient construction of FM-index. We show that FM-index can indeed be constructed from CSA directly in O(n) time. © Springer-Verlag Berlin Heidelberg 2003. |

dc.description.nature | Link_to_subscribed_fulltext |

dc.identifier.citation | Lecture Notes In Computer Science (Including Subseries Lecture Notes In Artificial Intelligence And Lecture Notes In Bioinformatics), 2003, v. 2906, p. 240-249 [How to Cite?] |

dc.identifier.epage | 249 |

dc.identifier.hkuros | 91576 |

dc.identifier.issn | 0302-9743 2012 SCImago Journal Rankings: 0.332 |

dc.identifier.scopus | eid_2-s2.0-35248823623 |

dc.identifier.spage | 240 |

dc.identifier.uri | http://hdl.handle.net/10722/93476 |

dc.identifier.volume | 2906 |

dc.language | eng |

dc.publisher | Springer Verlag. The Journal's web site is located at http://springerlink.com/content/105633/ |

dc.publisher.place | Germany |

dc.relation.ispartof | Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) |

dc.relation.references | References in Scopus |

dc.title | Constructing compressed suffix arrays with large alphabets |

dc.type | Article |

<?xml encoding="utf-8" version="1.0"?> <item><contributor.author>Hon, WK</contributor.author> <contributor.author>Lam, TW</contributor.author> <contributor.author>Sadakane, K</contributor.author> <contributor.author>Sung, WK</contributor.author> <date.accessioned>2010-09-25T15:02:19Z</date.accessioned> <date.available>2010-09-25T15:02:19Z</date.available> <date.issued>2003</date.issued> <identifier.citation>Lecture Notes In Computer Science (Including Subseries Lecture Notes In Artificial Intelligence And Lecture Notes In Bioinformatics), 2003, v. 2906, p. 240-249</identifier.citation> <identifier.issn>0302-9743</identifier.issn> <identifier.uri>http://hdl.handle.net/10722/93476</identifier.uri> <description.abstract>Recent research in compressing suffix arrays has resulted in two breakthrough indexing data structures, namely, compressed suffix arrays (CSA) [7] and FM-index [5]. Either of them makes it feasible to store a full-text index in the main memory even for a piece of text data with a few billion characters (such as human DNA). However, constructing such indexing data structures with limited working memory (i.e., without constructing suffix arrays) is not a trivial task. This paper addresses this problem. Currently, only CSA admits a space-efficient construction algorithm [15]. For a text T of length n over an alphabet Ó, this algorithm requires O(|∑|n log n) time and (2Ho + 1 + ε)n bits of working space, where Ho is the 0-th order empirical entropy of T and ε is any non-zero constant. This algorithm is good enough when the alphabet size |∑| is small. It is not practical for text data containing protein, Chinese or Japanese, where the alphabet may include up to a few thousand characters. The main contribution of this paper is a new algorithm which can construct CSA in O(n log n) time using (Ho + 2 + ε)n bits of working space. Note that the running time of our algorithm is independent of the alphabet size and the space requirement is smaller as it is likely that Ho > 1. This paper also makes contribution to the space-efficient construction of FM-index. We show that FM-index can indeed be constructed from CSA directly in O(n) time. © Springer-Verlag Berlin Heidelberg 2003.</description.abstract> <language>eng</language> <publisher>Springer Verlag. The Journal's web site is located at http://springerlink.com/content/105633/</publisher> <relation.ispartof>Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)</relation.ispartof> <title>Constructing compressed suffix arrays with large alphabets</title> <type>Article</type> <description.nature>Link_to_subscribed_fulltext</description.nature> <identifier.scopus>eid_2-s2.0-35248823623</identifier.scopus> <identifier.hkuros>91576</identifier.hkuros> <relation.references>http://www.scopus.com/mlt/select.url?eid=2-s2.0-35248823623&selection=ref&src=s&origin=recordpage</relation.references> <identifier.volume>2906</identifier.volume> <identifier.spage>240</identifier.spage> <identifier.epage>249</identifier.epage> <publisher.place>Germany</publisher.place> </item>

Author Affiliations

- Kyushu University
- The University of Hong Kong
- National University of Singapore