Domain specific text annotation : methods and applications

Yuan, Guowen; 袁国文

File Download

FullText.pdf

Supplementary

Citations:
Appears in Collections:
- HKU Theses Online
- Computer Science: Theses

postgraduate thesis: Domain specific text annotation : methods and applications

Title	Domain specific text annotation : methods and applications
Authors	Yuan, Guowen 袁国文
Advisors	Advisor(s):Kao, CM
Issue Date	2023
Publisher	The University of Hong Kong (Pokfulam, Hong Kong)
Citation	Yuan, G. [袁国文]. (2023). Domain specific text annotation : methods and applications. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
Abstract	Text annotation is a crucial task in natural language processing (NLP), allowing us to extract features that users are interested in from the text and label them accordingly. In this context, we study three problems: (1) we focus on training machine annotators in low-resource scenarios; (2) we explore strategies for cost-efficient machine-assisted text annotation; and (3) we investigate how to integrate the domain knowledge from existing annotations to optimize document retrieval and summarization performance. Addressing the first point, we delve into low-resource named entity recognition (NER) employing demonstration learning, tackling two key issues: demonstration construction and model training. To address the first issue, instead of solely relying on semantic similarity for selecting demonstration examples, we introduce dual similarity, incorporating both semantic and feature similarities. Concerning the second issue, we show that the NER tagger’s ability to reference demonstration examples is generally inadequate. To address this, we propose to train an NER model with adversarial demonstration such that the model is forced to refer to the demonstrations when performing the tagging task. Extensive experiments show our approach surpasses several existing methods in low-resource NER tasks. Secondly, we tackle the challenges associated with cost-efficient text annotation. Annotating documents, particularly those that are lengthy, feature-rich, or domain-specific, can be both time-consuming and costly. We propose CEMA, a method for deploying machine learning to assist humans in complex document annotation. CEMA estimates the human cost of annotating each document and selects the set of documents to be annotated that strike the best balance between model accuracy and human cost. Extensive experiments demonstrated that CEMA outperforms other document selection and annotation strategies on complex annotation tasks. Finally, online legal document libraries, such as WorldLII, are indispensable tools for legal professionals to conduct legal research. In this context, we study how topic modeling techniques can be applied to such platforms to facilitate searching of court judgments. Specifically, we improve search effectiveness by matching judgments to queries at semantics level rather than at keyword level. Also, we design a system that summarizes a retrieved judgment by highlighting a small number of paragraphs that are semantically most relevant to the user query or relate to other key aspects. We further improve our system by incorporating domain knowledge. Specifically, annotated text, serving as a carrier of domain knowledge, is extracted to generate more refined topics. The results of the evaluation experiment on the HKLII platform show that our methods are highly effective. In conclusion, our exploration into text annotation and its applications has yielded promising outcomes. Addressing the challenges outlined at the outset, we have successfully proposed innovative strategies for training machine annotators in low-resource scenarios, proposed CEMA for cost-effective machine-assisted text annotation, and illuminated the profound benefits of integrating domain knowledge and topic modeling for optimized document retrieval and summarization. As we advance in this field, the next frontier lies in harnessing the vast knowledge and formidable language understanding capabilities of large models to further refine our algorithms.
Degree	Doctor of Philosophy
Subject	Natural language processing (Computer science)
Dept/Program	Computer Science
Persistent Identifier	http://hdl.handle.net/10722/350248

DC Field	Value	Language
dc.contributor.advisor	Kao, CM	-
dc.contributor.author	Yuan, Guowen	-
dc.contributor.author	袁国文	-
dc.date.accessioned	2024-10-21T08:15:54Z	-
dc.date.available	2024-10-21T08:15:54Z	-
dc.date.issued	2023	-
dc.identifier.citation	Yuan, G. [袁国文]. (2023). Domain specific text annotation : methods and applications. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.	-
dc.identifier.uri	http://hdl.handle.net/10722/350248	-
dc.description.abstract	Text annotation is a crucial task in natural language processing (NLP), allowing us to extract features that users are interested in from the text and label them accordingly. In this context, we study three problems: (1) we focus on training machine annotators in low-resource scenarios; (2) we explore strategies for cost-efficient machine-assisted text annotation; and (3) we investigate how to integrate the domain knowledge from existing annotations to optimize document retrieval and summarization performance. Addressing the first point, we delve into low-resource named entity recognition (NER) employing demonstration learning, tackling two key issues: demonstration construction and model training. To address the first issue, instead of solely relying on semantic similarity for selecting demonstration examples, we introduce dual similarity, incorporating both semantic and feature similarities. Concerning the second issue, we show that the NER tagger’s ability to reference demonstration examples is generally inadequate. To address this, we propose to train an NER model with adversarial demonstration such that the model is forced to refer to the demonstrations when performing the tagging task. Extensive experiments show our approach surpasses several existing methods in low-resource NER tasks. Secondly, we tackle the challenges associated with cost-efficient text annotation. Annotating documents, particularly those that are lengthy, feature-rich, or domain-specific, can be both time-consuming and costly. We propose CEMA, a method for deploying machine learning to assist humans in complex document annotation. CEMA estimates the human cost of annotating each document and selects the set of documents to be annotated that strike the best balance between model accuracy and human cost. Extensive experiments demonstrated that CEMA outperforms other document selection and annotation strategies on complex annotation tasks. Finally, online legal document libraries, such as WorldLII, are indispensable tools for legal professionals to conduct legal research. In this context, we study how topic modeling techniques can be applied to such platforms to facilitate searching of court judgments. Specifically, we improve search effectiveness by matching judgments to queries at semantics level rather than at keyword level. Also, we design a system that summarizes a retrieved judgment by highlighting a small number of paragraphs that are semantically most relevant to the user query or relate to other key aspects. We further improve our system by incorporating domain knowledge. Specifically, annotated text, serving as a carrier of domain knowledge, is extracted to generate more refined topics. The results of the evaluation experiment on the HKLII platform show that our methods are highly effective. In conclusion, our exploration into text annotation and its applications has yielded promising outcomes. Addressing the challenges outlined at the outset, we have successfully proposed innovative strategies for training machine annotators in low-resource scenarios, proposed CEMA for cost-effective machine-assisted text annotation, and illuminated the profound benefits of integrating domain knowledge and topic modeling for optimized document retrieval and summarization. As we advance in this field, the next frontier lies in harnessing the vast knowledge and formidable language understanding capabilities of large models to further refine our algorithms.	-
dc.language	eng	-
dc.publisher	The University of Hong Kong (Pokfulam, Hong Kong)	-
dc.relation.ispartof	HKU Theses Online (HKUTO)	-
dc.rights	The author retains all proprietary rights, (such as patent rights) and the right to use in future works.	-
dc.rights	This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.	-
dc.subject.lcsh	Natural language processing (Computer science)	-
dc.title	Domain specific text annotation : methods and applications	-
dc.type	PG_Thesis	-
dc.description.thesisname	Doctor of Philosophy	-
dc.description.thesislevel	Doctoral	-
dc.description.thesisdiscipline	Computer Science	-
dc.description.nature	published_or_final_version	-
dc.date.hkucongregation	2024	-
dc.identifier.mmsid	991044745659503414	-

File Download

Supplementary

postgraduate thesis: Domain specific text annotation : methods and applications

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats