File Download
Supplementary

postgraduate thesis: Accurate text and data mining methods for biomedical knowledge discovery

TitleAccurate text and data mining methods for biomedical knowledge discovery
Authors
Advisors
Advisor(s):Ting, HFLam, TW
Issue Date2020
PublisherThe University of Hong Kong (Pokfulam, Hong Kong)
Citation
吴晔, [Wu, Ye]. (2020). Accurate text and data mining methods for biomedical knowledge discovery. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
AbstractBiomedical literature and high-throughput sequencing data are important sources of knowledge for developing better diagnoses and treatments. However, the high complexity and large volume of biomedical literature and sequencing data pose significant challenges for knowledge discovery and translation. This thesis introduces three novel biomedical text and data mining methods for automatic and accurate knowledge discovery. All of them were benchmarked in comprehensive experiments to show flexibility and advantage over existing methods. The first method introduced is RENET, a deep learning approach for gene-disease relation extraction in biomedical literature. Existing text-mining tools for extracting gene-disease associations have limited capacity, as each sentence is considered separately. Our experiments show that the best existing tools, such as BeFree and DTMiner, achieve precision of 48% and recall rate of 78% at most. In this study, we designed and implemented a deep learning approach, named RENET, which considers the correlation between the sentences in an article to extract gene-disease associations. Our method has significantly improved the precision and recall rate to 85.2% and 81.8%, respectively. The second method to present is BioNumQA-BERT, a deep language representation model using numerical facts for biomedical question answering. The current biomedical QA methods have limited capacity, as they commonly neglect the role of numerical facts in biomedical QA. We designed a new method called BioNumQA-BERT by introducing a novel numerical encoding scheme into the popular biomedical language model BioBERT to represent the numerical values in the input text. Our experiments show that BioNumQA-BERT significantly outperformed other state-of-art models, including DrQA, BERT, and BioBERT (39.0% vs 29.5%, 31.3%, and 33.2%, respectively, in strict accuracy). To improve the generalization ability of BioNumQA-BERT, we further pretrained it on a large biomedical text corpus and achieved 41.5% strict accuracy. The third method is Translocator, a translocation detection method for single-molecular long-read sequencing data. The recent development in single-molecule sequencing technologies that produce long reads has promised an advance in detecting translocations accurately. However, existing tools struggled with the high base error rate of the long reads. Figuring out the correct translocation breakpoints is especially challenging due to suboptimally aligned reads. To address the problem, we developed Translocator, a robust and accurate translocation detection method that implements an effective realignment algorithm to recover the correct alignments. For benchmarking, we analyzed using NA12878 long reads against a modified GRCh38 reference genome embedded with translocations at known locations. Our results show that Translocator significantly outperformed other state-of-the-art methods, including Sniffles and PBSV. On Oxford Nanopore data, the recall improved from 48.2% to 87.5% and the precision from 88.7% to 92.7%.
DegreeDoctor of Philosophy
SubjectMedical informatics
Data mining
Dept/ProgramComputer Science
Persistent Identifierhttp://hdl.handle.net/10722/302560

 

DC FieldValueLanguage
dc.contributor.advisorTing, HF-
dc.contributor.advisorLam, TW-
dc.contributor.author吴晔-
dc.contributor.authorWu, Ye-
dc.date.accessioned2021-09-07T03:41:27Z-
dc.date.available2021-09-07T03:41:27Z-
dc.date.issued2020-
dc.identifier.citation吴晔, [Wu, Ye]. (2020). Accurate text and data mining methods for biomedical knowledge discovery. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.-
dc.identifier.urihttp://hdl.handle.net/10722/302560-
dc.description.abstractBiomedical literature and high-throughput sequencing data are important sources of knowledge for developing better diagnoses and treatments. However, the high complexity and large volume of biomedical literature and sequencing data pose significant challenges for knowledge discovery and translation. This thesis introduces three novel biomedical text and data mining methods for automatic and accurate knowledge discovery. All of them were benchmarked in comprehensive experiments to show flexibility and advantage over existing methods. The first method introduced is RENET, a deep learning approach for gene-disease relation extraction in biomedical literature. Existing text-mining tools for extracting gene-disease associations have limited capacity, as each sentence is considered separately. Our experiments show that the best existing tools, such as BeFree and DTMiner, achieve precision of 48% and recall rate of 78% at most. In this study, we designed and implemented a deep learning approach, named RENET, which considers the correlation between the sentences in an article to extract gene-disease associations. Our method has significantly improved the precision and recall rate to 85.2% and 81.8%, respectively. The second method to present is BioNumQA-BERT, a deep language representation model using numerical facts for biomedical question answering. The current biomedical QA methods have limited capacity, as they commonly neglect the role of numerical facts in biomedical QA. We designed a new method called BioNumQA-BERT by introducing a novel numerical encoding scheme into the popular biomedical language model BioBERT to represent the numerical values in the input text. Our experiments show that BioNumQA-BERT significantly outperformed other state-of-art models, including DrQA, BERT, and BioBERT (39.0% vs 29.5%, 31.3%, and 33.2%, respectively, in strict accuracy). To improve the generalization ability of BioNumQA-BERT, we further pretrained it on a large biomedical text corpus and achieved 41.5% strict accuracy. The third method is Translocator, a translocation detection method for single-molecular long-read sequencing data. The recent development in single-molecule sequencing technologies that produce long reads has promised an advance in detecting translocations accurately. However, existing tools struggled with the high base error rate of the long reads. Figuring out the correct translocation breakpoints is especially challenging due to suboptimally aligned reads. To address the problem, we developed Translocator, a robust and accurate translocation detection method that implements an effective realignment algorithm to recover the correct alignments. For benchmarking, we analyzed using NA12878 long reads against a modified GRCh38 reference genome embedded with translocations at known locations. Our results show that Translocator significantly outperformed other state-of-the-art methods, including Sniffles and PBSV. On Oxford Nanopore data, the recall improved from 48.2% to 87.5% and the precision from 88.7% to 92.7%.-
dc.languageeng-
dc.publisherThe University of Hong Kong (Pokfulam, Hong Kong)-
dc.relation.ispartofHKU Theses Online (HKUTO)-
dc.rightsThe author retains all proprietary rights, (such as patent rights) and the right to use in future works.-
dc.rightsThis work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.-
dc.subject.lcshMedical informatics-
dc.subject.lcshData mining-
dc.titleAccurate text and data mining methods for biomedical knowledge discovery-
dc.typePG_Thesis-
dc.description.thesisnameDoctor of Philosophy-
dc.description.thesislevelDoctoral-
dc.description.thesisdisciplineComputer Science-
dc.description.naturepublished_or_final_version-
dc.date.hkucongregation2021-
dc.identifier.mmsid991044410250103414-

Export via OAI-PMH Interface in XML Formats


OR


Export to Other Non-XML Formats