File Download

There are no files associated with this item.

  Links for fulltext
     (May Require Subscription)
Supplementary

Conference Paper: Chinese Word Segmentation for Terrorism-related Contents

TitleChinese Word Segmentation for Terrorism-related Contents
Authors
KeywordsHeuristic rules
Lidstone flatness
Mutual information
N-gram
Suffix tree
Ukkonen algorithm
Issue Date2008
PublisherSpringer Verlag. The Journal's web site is located at http://springerlink.com/content/105633/
Citation
The 2008 IEEE International Conference on Intelligence and Security Informatics (ISI) Workshops (PAISI / PACCF / SOCO 2008), Taipei, Taiwan, 17 June 2008. In Lecture Notes in Computer Science, 2008, v. 5075, p. 1-13 How to Cite?
AbstractIn order to analyze security and terrorism related content in Chinese, it is important to perform word segmentation on Chinese documents. There are many previous studies on Chinese word segmentation. The two major approaches are statistic-based and dictionary-based approaches. The pure statistic methods have lower precision, while the pure dictionary-based method cannot deal with new words and are restricted to the coverage of the dictionary. In this paper, we propose a hybrid method that avoids the limitations of both approaches. Through the use of suffix tree and mutual information (MI) with the dictionary, our segmenter, called IASeg, achieves a high accuracy in word segmentation when domain training is available. It can identify new words through MI-based token merging and dictionary update. In addition, with the Improved Bigram method it can also process N-grams. To evaluate the performance of our segmenter, we compare it with the Hylanda segmenter and the ICTCLAS segmenter using a terrorism-related corpus. The experiment results show that IASeg performs better than the two benchmarks in both precision and recall. © 2008 Springer-Verlag Berlin Heidelberg.
Persistent Identifierhttp://hdl.handle.net/10722/112338
ISSN
2023 SCImago Journal Rankings: 0.606
References

 

DC FieldValueLanguage
dc.contributor.authorZeng, Den_HK
dc.contributor.authorWei, Den_HK
dc.contributor.authorChau, MCLen_HK
dc.contributor.authorWang, Fen_HK
dc.date.accessioned2010-09-26T03:27:47Z-
dc.date.available2010-09-26T03:27:47Z-
dc.date.issued2008en_HK
dc.identifier.citationThe 2008 IEEE International Conference on Intelligence and Security Informatics (ISI) Workshops (PAISI / PACCF / SOCO 2008), Taipei, Taiwan, 17 June 2008. In Lecture Notes in Computer Science, 2008, v. 5075, p. 1-13-
dc.identifier.issn0302-9743-
dc.identifier.urihttp://hdl.handle.net/10722/112338-
dc.description.abstractIn order to analyze security and terrorism related content in Chinese, it is important to perform word segmentation on Chinese documents. There are many previous studies on Chinese word segmentation. The two major approaches are statistic-based and dictionary-based approaches. The pure statistic methods have lower precision, while the pure dictionary-based method cannot deal with new words and are restricted to the coverage of the dictionary. In this paper, we propose a hybrid method that avoids the limitations of both approaches. Through the use of suffix tree and mutual information (MI) with the dictionary, our segmenter, called IASeg, achieves a high accuracy in word segmentation when domain training is available. It can identify new words through MI-based token merging and dictionary update. In addition, with the Improved Bigram method it can also process N-grams. To evaluate the performance of our segmenter, we compare it with the Hylanda segmenter and the ICTCLAS segmenter using a terrorism-related corpus. The experiment results show that IASeg performs better than the two benchmarks in both precision and recall. © 2008 Springer-Verlag Berlin Heidelberg.-
dc.languageengen_HK
dc.publisherSpringer Verlag. The Journal's web site is located at http://springerlink.com/content/105633/en_HK
dc.relation.ispartofLecture Notes In Computer Scienceen_HK
dc.subjectHeuristic rules-
dc.subjectLidstone flatness-
dc.subjectMutual information-
dc.subjectN-gram-
dc.subjectSuffix tree-
dc.subjectUkkonen algorithm-
dc.titleChinese Word Segmentation for Terrorism-related Contentsen_HK
dc.typeConference_Paperen_HK
dc.identifier.emailChau, MCL: mchau@hkucc.hku.hken_HK
dc.identifier.authorityChau, MCL=rp01051en_HK
dc.description.naturelink_to_subscribed_fulltext-
dc.identifier.doi10.1007/978-3-540-69304-8_1-
dc.identifier.scopuseid_2-s2.0-45849103858-
dc.identifier.hkuros148599en_HK
dc.relation.referenceshttp://www.scopus.com/mlt/select.url?eid=2-s2.0-45849103858&selection=ref&src=s&origin=recordpage-
dc.identifier.volume5075-
dc.identifier.spage1-
dc.identifier.epage13-
dc.publisher.placeGermany-
dc.identifier.scopusauthoridZeng, D=34668758000-
dc.identifier.scopusauthoridWei, D=24472144100-
dc.identifier.scopusauthoridChau, M=7006073763-
dc.identifier.scopusauthoridWang, F=7501308070-
dc.customcontrol.immutablesml 160120 - merged-
dc.identifier.issnl0302-9743-

Export via OAI-PMH Interface in XML Formats


OR


Export to Other Non-XML Formats