File Download
Supplementary
-
Citations:
- Appears in Collections:
postgraduate thesis: Domain specific text annotation : methods and applications
Title | Domain specific text annotation : methods and applications |
---|---|
Authors | |
Advisors | Advisor(s):Kao, CM |
Issue Date | 2023 |
Publisher | The University of Hong Kong (Pokfulam, Hong Kong) |
Citation | Yuan, G. [袁国文]. (2023). Domain specific text annotation : methods and applications. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. |
Abstract | Text annotation is a crucial task in natural language processing (NLP), allowing us to extract features that users are interested in from the text and label them accordingly. In this context, we study three problems: (1) we focus on training machine annotators in low-resource scenarios; (2) we explore strategies for cost-efficient machine-assisted text annotation; and (3) we investigate how to integrate the domain knowledge from existing annotations to optimize document retrieval and summarization performance.
Addressing the first point, we delve into low-resource named entity recognition (NER) employing demonstration learning, tackling two key issues: demonstration construction and model training. To address the first issue, instead of solely relying on semantic similarity for selecting demonstration examples, we introduce dual similarity, incorporating both semantic and feature similarities. Concerning the second issue, we show that the NER tagger’s ability to reference demonstration examples is generally inadequate. To address this, we propose to train an NER model with adversarial demonstration such that the model is forced to refer to the demonstrations when performing the tagging task. Extensive experiments show our approach surpasses several existing methods in low-resource NER tasks.
Secondly, we tackle the challenges associated with cost-efficient text annotation. Annotating documents, particularly those that are lengthy, feature-rich, or domain-specific, can be both time-consuming and costly. We propose CEMA, a method for deploying machine learning to assist humans in complex document annotation. CEMA estimates the human cost of annotating each document and selects the set of documents to be annotated that strike the best balance between model accuracy and human cost. Extensive experiments demonstrated that CEMA outperforms other document selection and annotation strategies on complex annotation tasks.
Finally, online legal document libraries, such as WorldLII, are indispensable tools for legal professionals to conduct legal research. In this context, we study how topic modeling techniques can be applied to such platforms to facilitate searching of court judgments. Specifically, we improve search effectiveness by matching judgments to queries at semantics level rather than at keyword level. Also, we design a system that summarizes a retrieved judgment by highlighting a small number of paragraphs that
are semantically most relevant to the user query or relate to other key aspects. We further improve our system by incorporating domain knowledge. Specifically, annotated text, serving as a carrier of domain knowledge, is extracted to generate more refined topics. The results of the evaluation experiment on the HKLII platform show that our methods are highly effective.
In conclusion, our exploration into text annotation and its applications has yielded promising outcomes. Addressing the challenges outlined at the outset, we have successfully proposed innovative strategies for training machine annotators in low-resource scenarios, proposed CEMA for cost-effective machine-assisted text annotation, and illuminated the profound benefits of integrating domain knowledge and topic modeling for optimized document retrieval and summarization. As we advance in this field, the next frontier lies in harnessing the vast knowledge and formidable language understanding capabilities of large models to further refine our algorithms. |
Degree | Doctor of Philosophy |
Subject | Natural language processing (Computer science) |
Dept/Program | Computer Science |
Persistent Identifier | http://hdl.handle.net/10722/350248 |
DC Field | Value | Language |
---|---|---|
dc.contributor.advisor | Kao, CM | - |
dc.contributor.author | Yuan, Guowen | - |
dc.contributor.author | 袁国文 | - |
dc.date.accessioned | 2024-10-21T08:15:54Z | - |
dc.date.available | 2024-10-21T08:15:54Z | - |
dc.date.issued | 2023 | - |
dc.identifier.citation | Yuan, G. [袁国文]. (2023). Domain specific text annotation : methods and applications. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. | - |
dc.identifier.uri | http://hdl.handle.net/10722/350248 | - |
dc.description.abstract | Text annotation is a crucial task in natural language processing (NLP), allowing us to extract features that users are interested in from the text and label them accordingly. In this context, we study three problems: (1) we focus on training machine annotators in low-resource scenarios; (2) we explore strategies for cost-efficient machine-assisted text annotation; and (3) we investigate how to integrate the domain knowledge from existing annotations to optimize document retrieval and summarization performance. Addressing the first point, we delve into low-resource named entity recognition (NER) employing demonstration learning, tackling two key issues: demonstration construction and model training. To address the first issue, instead of solely relying on semantic similarity for selecting demonstration examples, we introduce dual similarity, incorporating both semantic and feature similarities. Concerning the second issue, we show that the NER tagger’s ability to reference demonstration examples is generally inadequate. To address this, we propose to train an NER model with adversarial demonstration such that the model is forced to refer to the demonstrations when performing the tagging task. Extensive experiments show our approach surpasses several existing methods in low-resource NER tasks. Secondly, we tackle the challenges associated with cost-efficient text annotation. Annotating documents, particularly those that are lengthy, feature-rich, or domain-specific, can be both time-consuming and costly. We propose CEMA, a method for deploying machine learning to assist humans in complex document annotation. CEMA estimates the human cost of annotating each document and selects the set of documents to be annotated that strike the best balance between model accuracy and human cost. Extensive experiments demonstrated that CEMA outperforms other document selection and annotation strategies on complex annotation tasks. Finally, online legal document libraries, such as WorldLII, are indispensable tools for legal professionals to conduct legal research. In this context, we study how topic modeling techniques can be applied to such platforms to facilitate searching of court judgments. Specifically, we improve search effectiveness by matching judgments to queries at semantics level rather than at keyword level. Also, we design a system that summarizes a retrieved judgment by highlighting a small number of paragraphs that are semantically most relevant to the user query or relate to other key aspects. We further improve our system by incorporating domain knowledge. Specifically, annotated text, serving as a carrier of domain knowledge, is extracted to generate more refined topics. The results of the evaluation experiment on the HKLII platform show that our methods are highly effective. In conclusion, our exploration into text annotation and its applications has yielded promising outcomes. Addressing the challenges outlined at the outset, we have successfully proposed innovative strategies for training machine annotators in low-resource scenarios, proposed CEMA for cost-effective machine-assisted text annotation, and illuminated the profound benefits of integrating domain knowledge and topic modeling for optimized document retrieval and summarization. As we advance in this field, the next frontier lies in harnessing the vast knowledge and formidable language understanding capabilities of large models to further refine our algorithms. | - |
dc.language | eng | - |
dc.publisher | The University of Hong Kong (Pokfulam, Hong Kong) | - |
dc.relation.ispartof | HKU Theses Online (HKUTO) | - |
dc.rights | The author retains all proprietary rights, (such as patent rights) and the right to use in future works. | - |
dc.rights | This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. | - |
dc.subject.lcsh | Natural language processing (Computer science) | - |
dc.title | Domain specific text annotation : methods and applications | - |
dc.type | PG_Thesis | - |
dc.description.thesisname | Doctor of Philosophy | - |
dc.description.thesislevel | Doctoral | - |
dc.description.thesisdiscipline | Computer Science | - |
dc.description.nature | published_or_final_version | - |
dc.date.hkucongregation | 2024 | - |
dc.identifier.mmsid | 991044745659503414 | - |