File Download
There are no files associated with this item.
Links for fulltext
(May Require Subscription)
- Publisher Website: 10.1007/s10796-010-9278-5
- Scopus: eid_2-s2.0-79952897748
- WOS: WOS:000288220000010
- Find via
Supplementary
- Citations:
- Appears in Collections:
Article: Domain-specific Chinese word segmentation using suffix tree and mutual information
Title | Domain-specific Chinese word segmentation using suffix tree and mutual information | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Authors | |||||||||||||
Keywords | Chinese segmentation Heuristic rules Mutual information N-gram Suffix tree Ukkonen algorithm | ||||||||||||
Issue Date | 2011 | ||||||||||||
Publisher | Springer New York LLC. The Journal's web site is located at http://springerlink.metapress.com/openurl.asp?genre=journal&issn=1387-3326 | ||||||||||||
Citation | Information Systems Frontiers, 2011, v. 13 n. 1, p. 115-125 How to Cite? | ||||||||||||
Abstract | As the amount of online Chinese contents grows, there is a critical need for effective Chinese word segmentation approaches to facilitate Web computing applications in a range of domains including terrorism informatics. Most existing Chinese word segmentation approaches are either statistics-based or dictionary-based. The pure statistical method has lower precision, while the pure dictionary-based method cannot deal with new words beyond the dictionary. In this paper, we propose a hybrid method that is able to avoid the limitations of both types of approaches. Through the use of suffix tree and mutual information (MI) with the dictionary, our segmenter, called IASeg, achieves high accuracy in word segmentation when domain training is available. It can also identify new words through MI-based token merging and dictionary updating. In addition, with the proposed Improved Bigram method IASeg can process N-grams. To evaluate the performance of our segmenter, we compare it with two well-known systems, the Hylanda segmenter and the ICTCLAS segmenter, using a terrorism-centric corpus and a general corpus. The experiment results show that IASeg performs better than the benchmarks in both precision and recall for the domain-specific corpus and achieves comparable performance for the general corpus. © 2010 Springer Science+Business Media, LLC. | ||||||||||||
Persistent Identifier | http://hdl.handle.net/10722/139819 | ||||||||||||
ISSN | 2023 Impact Factor: 6.9 2023 SCImago Journal Rankings: 1.577 | ||||||||||||
ISI Accession Number ID |
Funding Information: The reported work was supported in part by the following grants: NNSFC #90924302 and #60921061, MOST #2006AA010106, CAS #2F07C01, NSF #IIS-0428241, and HKU #10207565. We thank our team member Mr. Qingyang Xu for his help with the experiments. We also thank Ms. Fenglin Li and Ms. Shufang Tang for their help with data preparation and processing. | ||||||||||||
References |
DC Field | Value | Language |
---|---|---|
dc.contributor.author | Zeng, D | en_HK |
dc.contributor.author | Wei, D | en_HK |
dc.contributor.author | Chau, M | en_HK |
dc.contributor.author | Wang, F | en_HK |
dc.date.accessioned | 2011-09-23T05:56:54Z | - |
dc.date.available | 2011-09-23T05:56:54Z | - |
dc.date.issued | 2011 | en_HK |
dc.identifier.citation | Information Systems Frontiers, 2011, v. 13 n. 1, p. 115-125 | en_HK |
dc.identifier.issn | 1387-3326 | en_HK |
dc.identifier.uri | http://hdl.handle.net/10722/139819 | - |
dc.description.abstract | As the amount of online Chinese contents grows, there is a critical need for effective Chinese word segmentation approaches to facilitate Web computing applications in a range of domains including terrorism informatics. Most existing Chinese word segmentation approaches are either statistics-based or dictionary-based. The pure statistical method has lower precision, while the pure dictionary-based method cannot deal with new words beyond the dictionary. In this paper, we propose a hybrid method that is able to avoid the limitations of both types of approaches. Through the use of suffix tree and mutual information (MI) with the dictionary, our segmenter, called IASeg, achieves high accuracy in word segmentation when domain training is available. It can also identify new words through MI-based token merging and dictionary updating. In addition, with the proposed Improved Bigram method IASeg can process N-grams. To evaluate the performance of our segmenter, we compare it with two well-known systems, the Hylanda segmenter and the ICTCLAS segmenter, using a terrorism-centric corpus and a general corpus. The experiment results show that IASeg performs better than the benchmarks in both precision and recall for the domain-specific corpus and achieves comparable performance for the general corpus. © 2010 Springer Science+Business Media, LLC. | en_HK |
dc.language | eng | en_US |
dc.publisher | Springer New York LLC. The Journal's web site is located at http://springerlink.metapress.com/openurl.asp?genre=journal&issn=1387-3326 | en_HK |
dc.relation.ispartof | Information Systems Frontiers | en_HK |
dc.rights | The original publication is available at www.springerlink.com | - |
dc.subject | Chinese segmentation | en_HK |
dc.subject | Heuristic rules | en_HK |
dc.subject | Mutual information | en_HK |
dc.subject | N-gram | en_HK |
dc.subject | Suffix tree | en_HK |
dc.subject | Ukkonen algorithm | en_HK |
dc.title | Domain-specific Chinese word segmentation using suffix tree and mutual information | en_HK |
dc.type | Article | en_HK |
dc.identifier.openurl | http://library.hku.hk:4550/resserv?sid=HKU:IR&issn=1387-3326&volume=13&issue=1&spage=115&epage=125&date=2011&atitle=Domain-specific+Chinese+word+segmentation+using+suffix+tree+and+mutual+information | - |
dc.identifier.email | Chau, M: mchau@hkucc.hku.hk | en_HK |
dc.identifier.authority | Chau, M=rp01051 | en_HK |
dc.description.nature | link_to_subscribed_fulltext | - |
dc.identifier.doi | 10.1007/s10796-010-9278-5 | en_HK |
dc.identifier.scopus | eid_2-s2.0-79952897748 | en_HK |
dc.identifier.hkuros | 193032 | en_US |
dc.relation.references | http://www.scopus.com/mlt/select.url?eid=2-s2.0-79952897748&selection=ref&src=s&origin=recordpage | en_HK |
dc.identifier.volume | 13 | en_HK |
dc.identifier.issue | 1 | en_HK |
dc.identifier.spage | 115 | en_HK |
dc.identifier.epage | 125 | en_HK |
dc.identifier.isi | WOS:000288220000010 | - |
dc.publisher.place | United States | en_HK |
dc.identifier.scopusauthorid | Zeng, D=7102694556 | en_HK |
dc.identifier.scopusauthorid | Wei, D=24472144100 | en_HK |
dc.identifier.scopusauthorid | Chau, M=7006073763 | en_HK |
dc.identifier.scopusauthorid | Wang, F=36549957500 | en_HK |
dc.identifier.citeulike | 8127385 | - |
dc.identifier.issnl | 1387-3326 | - |