File Download
There are no files associated with this item.
Links for fulltext
(May Require Subscription)
- Publisher Website: 10.1007/978-3-540-69304-8_1
- Scopus: eid_2-s2.0-45849103858
- Find via
Supplementary
-
Citations:
- Scopus: 0
- Appears in Collections:
Conference Paper: Chinese Word Segmentation for Terrorism-related Contents
Title | Chinese Word Segmentation for Terrorism-related Contents |
---|---|
Authors | |
Keywords | Heuristic rules Lidstone flatness Mutual information N-gram Suffix tree Ukkonen algorithm |
Issue Date | 2008 |
Publisher | Springer Verlag. The Journal's web site is located at http://springerlink.com/content/105633/ |
Citation | The 2008 IEEE International Conference on Intelligence and Security Informatics (ISI) Workshops (PAISI / PACCF / SOCO 2008), Taipei, Taiwan, 17 June 2008. In Lecture Notes in Computer Science, 2008, v. 5075, p. 1-13 How to Cite? |
Abstract | In order to analyze security and terrorism related content in Chinese, it is important to perform word segmentation on Chinese documents. There are many previous studies on Chinese word segmentation. The two major approaches are statistic-based and dictionary-based approaches. The pure statistic methods have lower precision, while the pure dictionary-based method cannot deal with new words and are restricted to the coverage of the dictionary. In this paper, we propose a hybrid method that avoids the limitations of both approaches. Through the use of suffix tree and mutual information (MI) with the dictionary, our segmenter, called IASeg, achieves a high accuracy in word segmentation when domain training is available. It can identify new words through MI-based token merging and dictionary update. In addition, with the Improved Bigram method it can also process N-grams. To evaluate the performance of our segmenter, we compare it with the Hylanda segmenter and the ICTCLAS segmenter using a terrorism-related corpus. The experiment results show that IASeg performs better than the two benchmarks in both precision and recall. © 2008 Springer-Verlag Berlin Heidelberg. |
Persistent Identifier | http://hdl.handle.net/10722/112338 |
ISSN | 2023 SCImago Journal Rankings: 0.606 |
References |
DC Field | Value | Language |
---|---|---|
dc.contributor.author | Zeng, D | en_HK |
dc.contributor.author | Wei, D | en_HK |
dc.contributor.author | Chau, MCL | en_HK |
dc.contributor.author | Wang, F | en_HK |
dc.date.accessioned | 2010-09-26T03:27:47Z | - |
dc.date.available | 2010-09-26T03:27:47Z | - |
dc.date.issued | 2008 | en_HK |
dc.identifier.citation | The 2008 IEEE International Conference on Intelligence and Security Informatics (ISI) Workshops (PAISI / PACCF / SOCO 2008), Taipei, Taiwan, 17 June 2008. In Lecture Notes in Computer Science, 2008, v. 5075, p. 1-13 | - |
dc.identifier.issn | 0302-9743 | - |
dc.identifier.uri | http://hdl.handle.net/10722/112338 | - |
dc.description.abstract | In order to analyze security and terrorism related content in Chinese, it is important to perform word segmentation on Chinese documents. There are many previous studies on Chinese word segmentation. The two major approaches are statistic-based and dictionary-based approaches. The pure statistic methods have lower precision, while the pure dictionary-based method cannot deal with new words and are restricted to the coverage of the dictionary. In this paper, we propose a hybrid method that avoids the limitations of both approaches. Through the use of suffix tree and mutual information (MI) with the dictionary, our segmenter, called IASeg, achieves a high accuracy in word segmentation when domain training is available. It can identify new words through MI-based token merging and dictionary update. In addition, with the Improved Bigram method it can also process N-grams. To evaluate the performance of our segmenter, we compare it with the Hylanda segmenter and the ICTCLAS segmenter using a terrorism-related corpus. The experiment results show that IASeg performs better than the two benchmarks in both precision and recall. © 2008 Springer-Verlag Berlin Heidelberg. | - |
dc.language | eng | en_HK |
dc.publisher | Springer Verlag. The Journal's web site is located at http://springerlink.com/content/105633/ | en_HK |
dc.relation.ispartof | Lecture Notes In Computer Science | en_HK |
dc.subject | Heuristic rules | - |
dc.subject | Lidstone flatness | - |
dc.subject | Mutual information | - |
dc.subject | N-gram | - |
dc.subject | Suffix tree | - |
dc.subject | Ukkonen algorithm | - |
dc.title | Chinese Word Segmentation for Terrorism-related Contents | en_HK |
dc.type | Conference_Paper | en_HK |
dc.identifier.email | Chau, MCL: mchau@hkucc.hku.hk | en_HK |
dc.identifier.authority | Chau, MCL=rp01051 | en_HK |
dc.description.nature | link_to_subscribed_fulltext | - |
dc.identifier.doi | 10.1007/978-3-540-69304-8_1 | - |
dc.identifier.scopus | eid_2-s2.0-45849103858 | - |
dc.identifier.hkuros | 148599 | en_HK |
dc.relation.references | http://www.scopus.com/mlt/select.url?eid=2-s2.0-45849103858&selection=ref&src=s&origin=recordpage | - |
dc.identifier.volume | 5075 | - |
dc.identifier.spage | 1 | - |
dc.identifier.epage | 13 | - |
dc.publisher.place | Germany | - |
dc.identifier.scopusauthorid | Zeng, D=34668758000 | - |
dc.identifier.scopusauthorid | Wei, D=24472144100 | - |
dc.identifier.scopusauthorid | Chau, M=7006073763 | - |
dc.identifier.scopusauthorid | Wang, F=7501308070 | - |
dc.customcontrol.immutable | sml 160120 - merged | - |
dc.identifier.issnl | 0302-9743 | - |