File Download
Supplementary

postgraduate thesis: Domain specific text annotation : methods and applications

TitleDomain specific text annotation : methods and applications
Authors
Advisors
Advisor(s):Kao, CM
Issue Date2023
PublisherThe University of Hong Kong (Pokfulam, Hong Kong)
Citation
Yuan, G. [袁国文]. (2023). Domain specific text annotation : methods and applications. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
AbstractText annotation is a crucial task in natural language processing (NLP), allowing us to extract features that users are interested in from the text and label them accordingly. In this context, we study three problems: (1) we focus on training machine annotators in low-resource scenarios; (2) we explore strategies for cost-efficient machine-assisted text annotation; and (3) we investigate how to integrate the domain knowledge from existing annotations to optimize document retrieval and summarization performance. Addressing the first point, we delve into low-resource named entity recognition (NER) employing demonstration learning, tackling two key issues: demonstration construction and model training. To address the first issue, instead of solely relying on semantic similarity for selecting demonstration examples, we introduce dual similarity, incorporating both semantic and feature similarities. Concerning the second issue, we show that the NER tagger’s ability to reference demonstration examples is generally inadequate. To address this, we propose to train an NER model with adversarial demonstration such that the model is forced to refer to the demonstrations when performing the tagging task. Extensive experiments show our approach surpasses several existing methods in low-resource NER tasks. Secondly, we tackle the challenges associated with cost-efficient text annotation. Annotating documents, particularly those that are lengthy, feature-rich, or domain-specific, can be both time-consuming and costly. We propose CEMA, a method for deploying machine learning to assist humans in complex document annotation. CEMA estimates the human cost of annotating each document and selects the set of documents to be annotated that strike the best balance between model accuracy and human cost. Extensive experiments demonstrated that CEMA outperforms other document selection and annotation strategies on complex annotation tasks. Finally, online legal document libraries, such as WorldLII, are indispensable tools for legal professionals to conduct legal research. In this context, we study how topic modeling techniques can be applied to such platforms to facilitate searching of court judgments. Specifically, we improve search effectiveness by matching judgments to queries at semantics level rather than at keyword level. Also, we design a system that summarizes a retrieved judgment by highlighting a small number of paragraphs that are semantically most relevant to the user query or relate to other key aspects. We further improve our system by incorporating domain knowledge. Specifically, annotated text, serving as a carrier of domain knowledge, is extracted to generate more refined topics. The results of the evaluation experiment on the HKLII platform show that our methods are highly effective. In conclusion, our exploration into text annotation and its applications has yielded promising outcomes. Addressing the challenges outlined at the outset, we have successfully proposed innovative strategies for training machine annotators in low-resource scenarios, proposed CEMA for cost-effective machine-assisted text annotation, and illuminated the profound benefits of integrating domain knowledge and topic modeling for optimized document retrieval and summarization. As we advance in this field, the next frontier lies in harnessing the vast knowledge and formidable language understanding capabilities of large models to further refine our algorithms.
DegreeDoctor of Philosophy
SubjectNatural language processing (Computer science)
Dept/ProgramComputer Science
Persistent Identifierhttp://hdl.handle.net/10722/350248

 

DC FieldValueLanguage
dc.contributor.advisorKao, CM-
dc.contributor.authorYuan, Guowen-
dc.contributor.author袁国文-
dc.date.accessioned2024-10-21T08:15:54Z-
dc.date.available2024-10-21T08:15:54Z-
dc.date.issued2023-
dc.identifier.citationYuan, G. [袁国文]. (2023). Domain specific text annotation : methods and applications. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.-
dc.identifier.urihttp://hdl.handle.net/10722/350248-
dc.description.abstractText annotation is a crucial task in natural language processing (NLP), allowing us to extract features that users are interested in from the text and label them accordingly. In this context, we study three problems: (1) we focus on training machine annotators in low-resource scenarios; (2) we explore strategies for cost-efficient machine-assisted text annotation; and (3) we investigate how to integrate the domain knowledge from existing annotations to optimize document retrieval and summarization performance. Addressing the first point, we delve into low-resource named entity recognition (NER) employing demonstration learning, tackling two key issues: demonstration construction and model training. To address the first issue, instead of solely relying on semantic similarity for selecting demonstration examples, we introduce dual similarity, incorporating both semantic and feature similarities. Concerning the second issue, we show that the NER tagger’s ability to reference demonstration examples is generally inadequate. To address this, we propose to train an NER model with adversarial demonstration such that the model is forced to refer to the demonstrations when performing the tagging task. Extensive experiments show our approach surpasses several existing methods in low-resource NER tasks. Secondly, we tackle the challenges associated with cost-efficient text annotation. Annotating documents, particularly those that are lengthy, feature-rich, or domain-specific, can be both time-consuming and costly. We propose CEMA, a method for deploying machine learning to assist humans in complex document annotation. CEMA estimates the human cost of annotating each document and selects the set of documents to be annotated that strike the best balance between model accuracy and human cost. Extensive experiments demonstrated that CEMA outperforms other document selection and annotation strategies on complex annotation tasks. Finally, online legal document libraries, such as WorldLII, are indispensable tools for legal professionals to conduct legal research. In this context, we study how topic modeling techniques can be applied to such platforms to facilitate searching of court judgments. Specifically, we improve search effectiveness by matching judgments to queries at semantics level rather than at keyword level. Also, we design a system that summarizes a retrieved judgment by highlighting a small number of paragraphs that are semantically most relevant to the user query or relate to other key aspects. We further improve our system by incorporating domain knowledge. Specifically, annotated text, serving as a carrier of domain knowledge, is extracted to generate more refined topics. The results of the evaluation experiment on the HKLII platform show that our methods are highly effective. In conclusion, our exploration into text annotation and its applications has yielded promising outcomes. Addressing the challenges outlined at the outset, we have successfully proposed innovative strategies for training machine annotators in low-resource scenarios, proposed CEMA for cost-effective machine-assisted text annotation, and illuminated the profound benefits of integrating domain knowledge and topic modeling for optimized document retrieval and summarization. As we advance in this field, the next frontier lies in harnessing the vast knowledge and formidable language understanding capabilities of large models to further refine our algorithms.-
dc.languageeng-
dc.publisherThe University of Hong Kong (Pokfulam, Hong Kong)-
dc.relation.ispartofHKU Theses Online (HKUTO)-
dc.rightsThe author retains all proprietary rights, (such as patent rights) and the right to use in future works.-
dc.rightsThis work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.-
dc.subject.lcshNatural language processing (Computer science)-
dc.titleDomain specific text annotation : methods and applications-
dc.typePG_Thesis-
dc.description.thesisnameDoctor of Philosophy-
dc.description.thesislevelDoctoral-
dc.description.thesisdisciplineComputer Science-
dc.description.naturepublished_or_final_version-
dc.date.hkucongregation2024-
dc.identifier.mmsid991044745659503414-

Export via OAI-PMH Interface in XML Formats


OR


Export to Other Non-XML Formats