Domain-specific Chinese word segmentation using suffix tree and mutual information

Zeng, D; Wei, D; Chau, M; Wang, F

File Download

There are no files associated with this item.

Links for fulltext

(May Require Subscription)

Publisher Website: 10.1007/s10796-010-9278-5
Scopus: eid_2-s2.0-79952897748
WOS: WOS:000288220000010
Find via

Supplementary

Citations:
- Scopus: 0
- Web of Science: 0
Appears in Collections:
- Faculty of Business & Economics: Journal/Magazine Articles

Article: Domain-specific Chinese word segmentation using suffix tree and mutual information

Title

Domain-specific Chinese word segmentation using suffix tree and mutual information

Authors

Zeng, D Wei, D Chau, M Wang, F

Keywords

Chinese segmentation
Heuristic rules
Mutual information
N-gram
Suffix tree
Ukkonen algorithm

Issue Date

2011

Publisher

Springer New York LLC. The Journal's web site is located at http://springerlink.metapress.com/openurl.asp?genre=journal&issn=1387-3326

Citation

Information Systems Frontiers, 2011, v. 13 n. 1, p. 115-125 How to Cite?

DOI: http://dx.doi.org/10.1007/s10796-010-9278-5

Abstract

As the amount of online Chinese contents grows, there is a critical need for effective Chinese word segmentation approaches to facilitate Web computing applications in a range of domains including terrorism informatics. Most existing Chinese word segmentation approaches are either statistics-based or dictionary-based. The pure statistical method has lower precision, while the pure dictionary-based method cannot deal with new words beyond the dictionary. In this paper, we propose a hybrid method that is able to avoid the limitations of both types of approaches. Through the use of suffix tree and mutual information (MI) with the dictionary, our segmenter, called IASeg, achieves high accuracy in word segmentation when domain training is available. It can also identify new words through MI-based token merging and dictionary updating. In addition, with the proposed Improved Bigram method IASeg can process N-grams. To evaluate the performance of our segmenter, we compare it with two well-known systems, the Hylanda segmenter and the ICTCLAS segmenter, using a terrorism-centric corpus and a general corpus. The experiment results show that IASeg performs better than the benchmarks in both precision and recall for the domain-specific corpus and achieves comparable performance for the general corpus. © 2010 Springer Science+Business Media, LLC.

Persistent Identifier

http://hdl.handle.net/10722/139819

ISSN

1387-3326

2023 Impact Factor: 6.9

2023 SCImago Journal Rankings: 1.577

ISI Accession Number ID

WOS:000288220000010

Funding Agency	Grant Number
NNSFC	90924302 60921061
MOST	2006AA010106
CAS	2F07C01
NSF	IIS-0428241
HKU	10207565

Funding Information:

The reported work was supported in part by the following grants: NNSFC #90924302 and #60921061, MOST #2006AA010106, CAS #2F07C01, NSF #IIS-0428241, and HKU #10207565. We thank our team member Mr. Qingyang Xu for his help with the experiments. We also thank Ms. Fenglin Li and Ms. Shufang Tang for their help with data preparation and processing.

References

References in Scopus

DC Field	Value	Language
dc.contributor.author	Zeng, D	en_HK
dc.contributor.author	Wei, D	en_HK
dc.contributor.author	Chau, M	en_HK
dc.contributor.author	Wang, F	en_HK
dc.date.accessioned	2011-09-23T05:56:54Z	-
dc.date.available	2011-09-23T05:56:54Z	-
dc.date.issued	2011	en_HK
dc.identifier.citation	Information Systems Frontiers, 2011, v. 13 n. 1, p. 115-125	en_HK
dc.identifier.issn	1387-3326	en_HK
dc.identifier.uri	http://hdl.handle.net/10722/139819	-
dc.description.abstract	As the amount of online Chinese contents grows, there is a critical need for effective Chinese word segmentation approaches to facilitate Web computing applications in a range of domains including terrorism informatics. Most existing Chinese word segmentation approaches are either statistics-based or dictionary-based. The pure statistical method has lower precision, while the pure dictionary-based method cannot deal with new words beyond the dictionary. In this paper, we propose a hybrid method that is able to avoid the limitations of both types of approaches. Through the use of suffix tree and mutual information (MI) with the dictionary, our segmenter, called IASeg, achieves high accuracy in word segmentation when domain training is available. It can also identify new words through MI-based token merging and dictionary updating. In addition, with the proposed Improved Bigram method IASeg can process N-grams. To evaluate the performance of our segmenter, we compare it with two well-known systems, the Hylanda segmenter and the ICTCLAS segmenter, using a terrorism-centric corpus and a general corpus. The experiment results show that IASeg performs better than the benchmarks in both precision and recall for the domain-specific corpus and achieves comparable performance for the general corpus. © 2010 Springer Science+Business Media, LLC.	en_HK
dc.language	eng	en_US
dc.publisher	Springer New York LLC. The Journal's web site is located at http://springerlink.metapress.com/openurl.asp?genre=journal&issn=1387-3326	en_HK
dc.relation.ispartof	Information Systems Frontiers	en_HK
dc.rights	The original publication is available at www.springerlink.com	-
dc.subject	Chinese segmentation	en_HK
dc.subject	Heuristic rules	en_HK
dc.subject	Mutual information	en_HK
dc.subject	N-gram	en_HK
dc.subject	Suffix tree	en_HK
dc.subject	Ukkonen algorithm	en_HK
dc.title	Domain-specific Chinese word segmentation using suffix tree and mutual information	en_HK
dc.type	Article	en_HK
dc.identifier.openurl	http://library.hku.hk:4550/resserv?sid=HKU:IR&issn=1387-3326&volume=13&issue=1&spage=115&epage=125&date=2011&atitle=Domain-specific+Chinese+word+segmentation+using+suffix+tree+and+mutual+information	-
dc.identifier.email	Chau, M: mchau@hkucc.hku.hk	en_HK
dc.identifier.authority	Chau, M=rp01051	en_HK
dc.description.nature	link_to_subscribed_fulltext	-
dc.identifier.doi	10.1007/s10796-010-9278-5	en_HK
dc.identifier.scopus	eid_2-s2.0-79952897748	en_HK
dc.identifier.hkuros	193032	en_US
dc.relation.references	http://www.scopus.com/mlt/select.url?eid=2-s2.0-79952897748&selection=ref&src=s&origin=recordpage	en_HK
dc.identifier.volume	13	en_HK
dc.identifier.issue	1	en_HK
dc.identifier.spage	115	en_HK
dc.identifier.epage	125	en_HK
dc.identifier.isi	WOS:000288220000010	-
dc.publisher.place	United States	en_HK
dc.identifier.scopusauthorid	Zeng, D=7102694556	en_HK
dc.identifier.scopusauthorid	Wei, D=24472144100	en_HK
dc.identifier.scopusauthorid	Chau, M=7006073763	en_HK
dc.identifier.scopusauthorid	Wang, F=36549957500	en_HK
dc.identifier.citeulike	8127385	-
dc.identifier.issnl	1387-3326	-

File Download

Links for fulltext

(May Require Subscription)

Supplementary

Article: Domain-specific Chinese word segmentation using suffix tree and mutual information

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats