Out-domain Chinese new word detection with statistics-based character embedding

LIANG, Y; YANG, M; Zhu, J; Yiu, SM

File Download

There are no files associated with this item.

Links for fulltext

(May Require Subscription)

Publisher Website: 10.1017/S1351324918000463
Scopus: eid_2-s2.0-85061363509
WOS: WOS:000462866100002
Find via

Supplementary

Citations:
- Scopus: 0
- Web of Science: 0
Appears in Collections:
- Computer Science: Journal/Magazine Articles

Article: Out-domain Chinese new word detection with statistics-based character embedding

Title	Out-domain Chinese new word detection with statistics-based character embedding
Authors	LIANG, Y YANG, M Zhu, J Yiu, SM
Keywords	Chinese character embedding Chinese new word detection Chinese word boundary detection
Issue Date	2019
Publisher	Cambridge University Press. The Journal's web site is located at http://journals.cambridge.org/action/displayJournal?jid=NLE
Citation	Natural Language Engineering, 2019, v. 25 n. 2, p. 239-255 How to Cite? DOI: http://dx.doi.org/10.1017/S1351324918000463
Abstract	Unlike English and other Western languages, many Asian languages such as Chinese and Japanese do not delimit words by space. Word segmentation and new word detection are therefore key steps in processing these languages. Chinese word segmentation can be considered as a part-of-speech (POS)-tagging problem. We can segment corpus by assigning a label for each character which indicates the position of the character in a word (e.g., “B” for word beginning, and “E” for the end of the word, etc.). Chinese word segmentation seems to be well studied. Machine learning models such as conditional random field (CRF) and bi-directional long short-term memory (LSTM) have shown outstanding performances on this task. However, the segmentation accuracies drop significantly when applying the same approaches to out-domain cases, in which high-quality in-domain training data are not available. An example of out-domain applications is the new word detection in Chinese microblogs for which the availability of high-quality corpus is limited. In this paper, we focus on out-domain Chinese new word detection. We first design a new method Edge Likelihood (EL) for Chinese word boundary detection. Then we propose a domain-independent Chinese new word detector (DICND); each Chinese character is represented as a low-dimensional vector in the proposed framework, and segmentation-related features of the character are used as the values in the vector.
Persistent Identifier	http://hdl.handle.net/10722/277570
ISSN	1351-3249 2023 Impact Factor: 2.3 2023 SCImago Journal Rankings: 0.664
ISI Accession Number ID	WOS:000462866100002

DC Field	Value	Language
dc.contributor.author	LIANG, Y	-
dc.contributor.author	YANG, M	-
dc.contributor.author	Zhu, J	-
dc.contributor.author	Yiu, SM	-
dc.date.accessioned	2019-09-20T08:53:34Z	-
dc.date.available	2019-09-20T08:53:34Z	-
dc.date.issued	2019	-
dc.identifier.citation	Natural Language Engineering, 2019, v. 25 n. 2, p. 239-255	-
dc.identifier.issn	1351-3249	-
dc.identifier.uri	http://hdl.handle.net/10722/277570	-
dc.description.abstract	Unlike English and other Western languages, many Asian languages such as Chinese and Japanese do not delimit words by space. Word segmentation and new word detection are therefore key steps in processing these languages. Chinese word segmentation can be considered as a part-of-speech (POS)-tagging problem. We can segment corpus by assigning a label for each character which indicates the position of the character in a word (e.g., “B” for word beginning, and “E” for the end of the word, etc.). Chinese word segmentation seems to be well studied. Machine learning models such as conditional random field (CRF) and bi-directional long short-term memory (LSTM) have shown outstanding performances on this task. However, the segmentation accuracies drop significantly when applying the same approaches to out-domain cases, in which high-quality in-domain training data are not available. An example of out-domain applications is the new word detection in Chinese microblogs for which the availability of high-quality corpus is limited. In this paper, we focus on out-domain Chinese new word detection. We first design a new method Edge Likelihood (EL) for Chinese word boundary detection. Then we propose a domain-independent Chinese new word detector (DICND); each Chinese character is represented as a low-dimensional vector in the proposed framework, and segmentation-related features of the character are used as the values in the vector.	-
dc.language	eng	-
dc.publisher	Cambridge University Press. The Journal's web site is located at http://journals.cambridge.org/action/displayJournal?jid=NLE	-
dc.relation.ispartof	Natural Language Engineering	-
dc.rights	Natural Language Engineering. Copyright © Cambridge University Press.	-
dc.rights	This article has been published in a revised form in [Journal] [http://doi.org/XXX]. This version is free to view and download for private research and study only. Not for re-distribution, re-sale or use in derivative works. © copyright holder.	-
dc.subject	Chinese character embedding	-
dc.subject	Chinese new word detection	-
dc.subject	Chinese word boundary detection	-
dc.title	Out-domain Chinese new word detection with statistics-based character embedding	-
dc.type	Article	-
dc.identifier.email	Yiu, SM: smyiu@cs.hku.hk	-
dc.identifier.authority	Yiu, SM=rp00207	-
dc.description.nature	link_to_subscribed_fulltext	-
dc.identifier.doi	10.1017/S1351324918000463	-
dc.identifier.scopus	eid_2-s2.0-85061363509	-
dc.identifier.hkuros	305931	-
dc.identifier.volume	25	-
dc.identifier.issue	2	-
dc.identifier.spage	239	-
dc.identifier.epage	255	-
dc.identifier.isi	WOS:000462866100002	-
dc.publisher.place	United Kingdom	-
dc.identifier.issnl	1351-3249	-

File Download

Links for fulltext

(May Require Subscription)

Supplementary

Article: Out-domain Chinese new word detection with statistics-based character embedding

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats