File Download
There are no files associated with this item.
Links for fulltext
(May Require Subscription)
- Publisher Website: 10.1017/S1351324918000463
- Scopus: eid_2-s2.0-85061363509
- WOS: WOS:000462866100002
- Find via
Supplementary
- Citations:
- Appears in Collections:
Article: Out-domain Chinese new word detection with statistics-based character embedding
Title | Out-domain Chinese new word detection with statistics-based character embedding |
---|---|
Authors | |
Keywords | Chinese character embedding Chinese new word detection Chinese word boundary detection |
Issue Date | 2019 |
Publisher | Cambridge University Press. The Journal's web site is located at http://journals.cambridge.org/action/displayJournal?jid=NLE |
Citation | Natural Language Engineering, 2019, v. 25 n. 2, p. 239-255 How to Cite? |
Abstract | Unlike English and other Western languages, many Asian languages such as Chinese and Japanese do not delimit words by space. Word segmentation and new word detection are therefore key steps in processing these languages. Chinese word segmentation can be considered as a part-of-speech (POS)-tagging problem. We can segment corpus by assigning a label for each character which indicates the position of the character in a word (e.g., “B” for word beginning, and “E” for the end of the word, etc.). Chinese word segmentation seems to be well studied. Machine learning models such as conditional random field (CRF) and bi-directional long short-term memory (LSTM) have shown outstanding performances on this task. However, the segmentation accuracies drop significantly when applying the same approaches to out-domain cases, in which high-quality in-domain training data are not available. An example of out-domain applications is the new word detection in Chinese microblogs for which the availability of high-quality corpus is limited. In this paper, we focus on out-domain Chinese new word detection. We first design a new method Edge Likelihood (EL) for Chinese word boundary detection. Then we propose a domain-independent Chinese new word detector (DICND); each Chinese character is represented as a low-dimensional vector in the proposed framework, and segmentation-related features of the character are used as the values in the vector. |
Persistent Identifier | http://hdl.handle.net/10722/277570 |
ISSN | 2023 Impact Factor: 2.3 2023 SCImago Journal Rankings: 0.664 |
ISI Accession Number ID |
DC Field | Value | Language |
---|---|---|
dc.contributor.author | LIANG, Y | - |
dc.contributor.author | YANG, M | - |
dc.contributor.author | Zhu, J | - |
dc.contributor.author | Yiu, SM | - |
dc.date.accessioned | 2019-09-20T08:53:34Z | - |
dc.date.available | 2019-09-20T08:53:34Z | - |
dc.date.issued | 2019 | - |
dc.identifier.citation | Natural Language Engineering, 2019, v. 25 n. 2, p. 239-255 | - |
dc.identifier.issn | 1351-3249 | - |
dc.identifier.uri | http://hdl.handle.net/10722/277570 | - |
dc.description.abstract | Unlike English and other Western languages, many Asian languages such as Chinese and Japanese do not delimit words by space. Word segmentation and new word detection are therefore key steps in processing these languages. Chinese word segmentation can be considered as a part-of-speech (POS)-tagging problem. We can segment corpus by assigning a label for each character which indicates the position of the character in a word (e.g., “B” for word beginning, and “E” for the end of the word, etc.). Chinese word segmentation seems to be well studied. Machine learning models such as conditional random field (CRF) and bi-directional long short-term memory (LSTM) have shown outstanding performances on this task. However, the segmentation accuracies drop significantly when applying the same approaches to out-domain cases, in which high-quality in-domain training data are not available. An example of out-domain applications is the new word detection in Chinese microblogs for which the availability of high-quality corpus is limited. In this paper, we focus on out-domain Chinese new word detection. We first design a new method Edge Likelihood (EL) for Chinese word boundary detection. Then we propose a domain-independent Chinese new word detector (DICND); each Chinese character is represented as a low-dimensional vector in the proposed framework, and segmentation-related features of the character are used as the values in the vector. | - |
dc.language | eng | - |
dc.publisher | Cambridge University Press. The Journal's web site is located at http://journals.cambridge.org/action/displayJournal?jid=NLE | - |
dc.relation.ispartof | Natural Language Engineering | - |
dc.rights | Natural Language Engineering. Copyright © Cambridge University Press. | - |
dc.rights | This article has been published in a revised form in [Journal] [http://doi.org/XXX]. This version is free to view and download for private research and study only. Not for re-distribution, re-sale or use in derivative works. © copyright holder. | - |
dc.subject | Chinese character embedding | - |
dc.subject | Chinese new word detection | - |
dc.subject | Chinese word boundary detection | - |
dc.title | Out-domain Chinese new word detection with statistics-based character embedding | - |
dc.type | Article | - |
dc.identifier.email | Yiu, SM: smyiu@cs.hku.hk | - |
dc.identifier.authority | Yiu, SM=rp00207 | - |
dc.description.nature | link_to_subscribed_fulltext | - |
dc.identifier.doi | 10.1017/S1351324918000463 | - |
dc.identifier.scopus | eid_2-s2.0-85061363509 | - |
dc.identifier.hkuros | 305931 | - |
dc.identifier.volume | 25 | - |
dc.identifier.issue | 2 | - |
dc.identifier.spage | 239 | - |
dc.identifier.epage | 255 | - |
dc.identifier.isi | WOS:000462866100002 | - |
dc.publisher.place | United Kingdom | - |
dc.identifier.issnl | 1351-3249 | - |