Accurate text and data mining methods for biomedical knowledge discovery

吴晔; Wu, Ye

File Download

FullText.pdf

Supplementary

Citations:
Appears in Collections:
- HKU Theses Online
- Computer Science: Theses

postgraduate thesis: Accurate text and data mining methods for biomedical knowledge discovery

Title	Accurate text and data mining methods for biomedical knowledge discovery
Authors	吴晔 Wu, Ye
Advisors	Advisor(s):Ting, HF Lam, TW
Issue Date	2020
Publisher	The University of Hong Kong (Pokfulam, Hong Kong)
Citation	吴晔, [Wu, Ye]. (2020). Accurate text and data mining methods for biomedical knowledge discovery. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
Abstract	Biomedical literature and high-throughput sequencing data are important sources of knowledge for developing better diagnoses and treatments. However, the high complexity and large volume of biomedical literature and sequencing data pose significant challenges for knowledge discovery and translation. This thesis introduces three novel biomedical text and data mining methods for automatic and accurate knowledge discovery. All of them were benchmarked in comprehensive experiments to show flexibility and advantage over existing methods. The first method introduced is RENET, a deep learning approach for gene-disease relation extraction in biomedical literature. Existing text-mining tools for extracting gene-disease associations have limited capacity, as each sentence is considered separately. Our experiments show that the best existing tools, such as BeFree and DTMiner, achieve precision of 48% and recall rate of 78% at most. In this study, we designed and implemented a deep learning approach, named RENET, which considers the correlation between the sentences in an article to extract gene-disease associations. Our method has significantly improved the precision and recall rate to 85.2% and 81.8%, respectively. The second method to present is BioNumQA-BERT, a deep language representation model using numerical facts for biomedical question answering. The current biomedical QA methods have limited capacity, as they commonly neglect the role of numerical facts in biomedical QA. We designed a new method called BioNumQA-BERT by introducing a novel numerical encoding scheme into the popular biomedical language model BioBERT to represent the numerical values in the input text. Our experiments show that BioNumQA-BERT significantly outperformed other state-of-art models, including DrQA, BERT, and BioBERT (39.0% vs 29.5%, 31.3%, and 33.2%, respectively, in strict accuracy). To improve the generalization ability of BioNumQA-BERT, we further pretrained it on a large biomedical text corpus and achieved 41.5% strict accuracy. The third method is Translocator, a translocation detection method for single-molecular long-read sequencing data. The recent development in single-molecule sequencing technologies that produce long reads has promised an advance in detecting translocations accurately. However, existing tools struggled with the high base error rate of the long reads. Figuring out the correct translocation breakpoints is especially challenging due to suboptimally aligned reads. To address the problem, we developed Translocator, a robust and accurate translocation detection method that implements an effective realignment algorithm to recover the correct alignments. For benchmarking, we analyzed using NA12878 long reads against a modified GRCh38 reference genome embedded with translocations at known locations. Our results show that Translocator significantly outperformed other state-of-the-art methods, including Sniffles and PBSV. On Oxford Nanopore data, the recall improved from 48.2% to 87.5% and the precision from 88.7% to 92.7%.
Degree	Doctor of Philosophy
Subject	Medical informatics Data mining
Dept/Program	Computer Science
Persistent Identifier	http://hdl.handle.net/10722/302560

DC Field	Value	Language
dc.contributor.advisor	Ting, HF	-
dc.contributor.advisor	Lam, TW	-
dc.contributor.author	吴晔	-
dc.contributor.author	Wu, Ye	-
dc.date.accessioned	2021-09-07T03:41:27Z	-
dc.date.available	2021-09-07T03:41:27Z	-
dc.date.issued	2020	-
dc.identifier.citation	吴晔, [Wu, Ye]. (2020). Accurate text and data mining methods for biomedical knowledge discovery. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.	-
dc.identifier.uri	http://hdl.handle.net/10722/302560	-
dc.description.abstract	Biomedical literature and high-throughput sequencing data are important sources of knowledge for developing better diagnoses and treatments. However, the high complexity and large volume of biomedical literature and sequencing data pose significant challenges for knowledge discovery and translation. This thesis introduces three novel biomedical text and data mining methods for automatic and accurate knowledge discovery. All of them were benchmarked in comprehensive experiments to show flexibility and advantage over existing methods. The first method introduced is RENET, a deep learning approach for gene-disease relation extraction in biomedical literature. Existing text-mining tools for extracting gene-disease associations have limited capacity, as each sentence is considered separately. Our experiments show that the best existing tools, such as BeFree and DTMiner, achieve precision of 48% and recall rate of 78% at most. In this study, we designed and implemented a deep learning approach, named RENET, which considers the correlation between the sentences in an article to extract gene-disease associations. Our method has significantly improved the precision and recall rate to 85.2% and 81.8%, respectively. The second method to present is BioNumQA-BERT, a deep language representation model using numerical facts for biomedical question answering. The current biomedical QA methods have limited capacity, as they commonly neglect the role of numerical facts in biomedical QA. We designed a new method called BioNumQA-BERT by introducing a novel numerical encoding scheme into the popular biomedical language model BioBERT to represent the numerical values in the input text. Our experiments show that BioNumQA-BERT significantly outperformed other state-of-art models, including DrQA, BERT, and BioBERT (39.0% vs 29.5%, 31.3%, and 33.2%, respectively, in strict accuracy). To improve the generalization ability of BioNumQA-BERT, we further pretrained it on a large biomedical text corpus and achieved 41.5% strict accuracy. The third method is Translocator, a translocation detection method for single-molecular long-read sequencing data. The recent development in single-molecule sequencing technologies that produce long reads has promised an advance in detecting translocations accurately. However, existing tools struggled with the high base error rate of the long reads. Figuring out the correct translocation breakpoints is especially challenging due to suboptimally aligned reads. To address the problem, we developed Translocator, a robust and accurate translocation detection method that implements an effective realignment algorithm to recover the correct alignments. For benchmarking, we analyzed using NA12878 long reads against a modified GRCh38 reference genome embedded with translocations at known locations. Our results show that Translocator significantly outperformed other state-of-the-art methods, including Sniffles and PBSV. On Oxford Nanopore data, the recall improved from 48.2% to 87.5% and the precision from 88.7% to 92.7%.	-
dc.language	eng	-
dc.publisher	The University of Hong Kong (Pokfulam, Hong Kong)	-
dc.relation.ispartof	HKU Theses Online (HKUTO)	-
dc.rights	The author retains all proprietary rights, (such as patent rights) and the right to use in future works.	-
dc.rights	This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.	-
dc.subject.lcsh	Medical informatics	-
dc.subject.lcsh	Data mining	-
dc.title	Accurate text and data mining methods for biomedical knowledge discovery	-
dc.type	PG_Thesis	-
dc.description.thesisname	Doctor of Philosophy	-
dc.description.thesislevel	Doctoral	-
dc.description.thesisdiscipline	Computer Science	-
dc.description.nature	published_or_final_version	-
dc.date.hkucongregation	2021	-
dc.identifier.mmsid	991044410250103414	-

File Download

Supplementary

postgraduate thesis: Accurate text and data mining methods for biomedical knowledge discovery

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats