File Download

There are no files associated with this item.

  Links for fulltext
     (May Require Subscription)
Supplementary

Article: Domain-specific Chinese word segmentation using suffix tree and mutual information

TitleDomain-specific Chinese word segmentation using suffix tree and mutual information
Authors
KeywordsChinese segmentation
Heuristic rules
Mutual information
N-gram
Suffix tree
Ukkonen algorithm
Issue Date2011
PublisherSpringer New York LLC. The Journal's web site is located at http://springerlink.metapress.com/openurl.asp?genre=journal&issn=1387-3326
Citation
Information Systems Frontiers, 2011, v. 13 n. 1, p. 115-125 How to Cite?
AbstractAs the amount of online Chinese contents grows, there is a critical need for effective Chinese word segmentation approaches to facilitate Web computing applications in a range of domains including terrorism informatics. Most existing Chinese word segmentation approaches are either statistics-based or dictionary-based. The pure statistical method has lower precision, while the pure dictionary-based method cannot deal with new words beyond the dictionary. In this paper, we propose a hybrid method that is able to avoid the limitations of both types of approaches. Through the use of suffix tree and mutual information (MI) with the dictionary, our segmenter, called IASeg, achieves high accuracy in word segmentation when domain training is available. It can also identify new words through MI-based token merging and dictionary updating. In addition, with the proposed Improved Bigram method IASeg can process N-grams. To evaluate the performance of our segmenter, we compare it with two well-known systems, the Hylanda segmenter and the ICTCLAS segmenter, using a terrorism-centric corpus and a general corpus. The experiment results show that IASeg performs better than the benchmarks in both precision and recall for the domain-specific corpus and achieves comparable performance for the general corpus. © 2010 Springer Science+Business Media, LLC.
Persistent Identifierhttp://hdl.handle.net/10722/139819
ISSN
2015 Impact Factor: 1.45
2015 SCImago Journal Rankings: 0.756
ISI Accession Number ID
Funding AgencyGrant Number
NNSFC90924302
60921061
MOST2006AA010106
CAS2F07C01
NSFIIS-0428241
HKU10207565
Funding Information:

The reported work was supported in part by the following grants: NNSFC #90924302 and #60921061, MOST #2006AA010106, CAS #2F07C01, NSF #IIS-0428241, and HKU #10207565. We thank our team member Mr. Qingyang Xu for his help with the experiments. We also thank Ms. Fenglin Li and Ms. Shufang Tang for their help with data preparation and processing.

References

 

DC FieldValueLanguage
dc.contributor.authorZeng, Den_HK
dc.contributor.authorWei, Den_HK
dc.contributor.authorChau, Men_HK
dc.contributor.authorWang, Fen_HK
dc.date.accessioned2011-09-23T05:56:54Z-
dc.date.available2011-09-23T05:56:54Z-
dc.date.issued2011en_HK
dc.identifier.citationInformation Systems Frontiers, 2011, v. 13 n. 1, p. 115-125en_HK
dc.identifier.issn1387-3326en_HK
dc.identifier.urihttp://hdl.handle.net/10722/139819-
dc.description.abstractAs the amount of online Chinese contents grows, there is a critical need for effective Chinese word segmentation approaches to facilitate Web computing applications in a range of domains including terrorism informatics. Most existing Chinese word segmentation approaches are either statistics-based or dictionary-based. The pure statistical method has lower precision, while the pure dictionary-based method cannot deal with new words beyond the dictionary. In this paper, we propose a hybrid method that is able to avoid the limitations of both types of approaches. Through the use of suffix tree and mutual information (MI) with the dictionary, our segmenter, called IASeg, achieves high accuracy in word segmentation when domain training is available. It can also identify new words through MI-based token merging and dictionary updating. In addition, with the proposed Improved Bigram method IASeg can process N-grams. To evaluate the performance of our segmenter, we compare it with two well-known systems, the Hylanda segmenter and the ICTCLAS segmenter, using a terrorism-centric corpus and a general corpus. The experiment results show that IASeg performs better than the benchmarks in both precision and recall for the domain-specific corpus and achieves comparable performance for the general corpus. © 2010 Springer Science+Business Media, LLC.en_HK
dc.languageengen_US
dc.publisherSpringer New York LLC. The Journal's web site is located at http://springerlink.metapress.com/openurl.asp?genre=journal&issn=1387-3326en_HK
dc.relation.ispartofInformation Systems Frontiersen_HK
dc.rightsThe original publication is available at www.springerlink.com-
dc.subjectChinese segmentationen_HK
dc.subjectHeuristic rulesen_HK
dc.subjectMutual informationen_HK
dc.subjectN-gramen_HK
dc.subjectSuffix treeen_HK
dc.subjectUkkonen algorithmen_HK
dc.titleDomain-specific Chinese word segmentation using suffix tree and mutual informationen_HK
dc.typeArticleen_HK
dc.identifier.openurlhttp://library.hku.hk:4550/resserv?sid=HKU:IR&issn=1387-3326&volume=13&issue=1&spage=115&epage=125&date=2011&atitle=Domain-specific+Chinese+word+segmentation+using+suffix+tree+and+mutual+information-
dc.identifier.emailChau, M: mchau@hkucc.hku.hken_HK
dc.identifier.authorityChau, M=rp01051en_HK
dc.description.naturelink_to_subscribed_fulltext-
dc.identifier.doi10.1007/s10796-010-9278-5en_HK
dc.identifier.scopuseid_2-s2.0-79952897748en_HK
dc.identifier.hkuros193032en_US
dc.relation.referenceshttp://www.scopus.com/mlt/select.url?eid=2-s2.0-79952897748&selection=ref&src=s&origin=recordpageen_HK
dc.identifier.volume13en_HK
dc.identifier.issue1en_HK
dc.identifier.spage115en_HK
dc.identifier.epage125en_HK
dc.identifier.isiWOS:000288220000010-
dc.publisher.placeUnited Statesen_HK
dc.identifier.scopusauthoridZeng, D=7102694556en_HK
dc.identifier.scopusauthoridWei, D=24472144100en_HK
dc.identifier.scopusauthoridChau, M=7006073763en_HK
dc.identifier.scopusauthoridWang, F=36549957500en_HK
dc.identifier.citeulike8127385-

Export via OAI-PMH Interface in XML Formats


OR


Export to Other Non-XML Formats