File Download
Supplementary
-
Citations:
- Appears in Collections:
postgraduate thesis: Accurate text and data mining methods for biomedical knowledge discovery
Title | Accurate text and data mining methods for biomedical knowledge discovery |
---|---|
Authors | |
Advisors | |
Issue Date | 2020 |
Publisher | The University of Hong Kong (Pokfulam, Hong Kong) |
Citation | 吴晔, [Wu, Ye]. (2020). Accurate text and data mining methods for biomedical knowledge discovery. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. |
Abstract | Biomedical literature and high-throughput sequencing data are important sources of knowledge for developing better diagnoses and treatments. However, the high complexity and large volume of biomedical literature and sequencing data pose significant challenges for knowledge discovery and translation. This thesis introduces three novel biomedical text and data mining methods for automatic and accurate knowledge discovery. All of them were benchmarked in comprehensive experiments to show flexibility and advantage over existing methods.
The first method introduced is RENET, a deep learning approach for gene-disease relation extraction in biomedical literature. Existing text-mining tools for extracting gene-disease associations have limited capacity, as each sentence is considered separately. Our experiments show that the best existing tools, such as BeFree and DTMiner, achieve precision of 48% and recall rate of 78% at most. In this study, we designed and implemented a deep learning approach, named RENET, which considers the correlation between the sentences in an article to extract gene-disease associations. Our method has significantly improved the precision and recall rate to 85.2% and 81.8%, respectively.
The second method to present is BioNumQA-BERT, a deep language representation model using numerical facts for biomedical question answering. The current biomedical QA methods have limited capacity, as they commonly neglect the role of numerical facts in biomedical QA. We designed a new method called BioNumQA-BERT by introducing a novel numerical encoding scheme into the popular biomedical language model BioBERT to represent the numerical values in the input text. Our experiments show that BioNumQA-BERT significantly outperformed other state-of-art models, including DrQA, BERT, and BioBERT (39.0% vs 29.5%, 31.3%, and 33.2%, respectively, in strict accuracy). To improve the generalization ability of BioNumQA-BERT, we further pretrained it on a large biomedical text corpus and achieved 41.5% strict accuracy.
The third method is Translocator, a translocation detection method for single-molecular long-read sequencing data. The recent development in single-molecule sequencing technologies that produce long reads has promised an advance in detecting translocations accurately. However, existing tools struggled with the high base error rate of the long reads. Figuring out the correct translocation breakpoints is especially challenging due to suboptimally aligned reads. To address the problem, we developed Translocator, a robust and accurate translocation detection method that implements an effective realignment algorithm to recover the correct alignments. For benchmarking, we analyzed using NA12878 long reads against a modified GRCh38 reference genome embedded with translocations at known locations. Our results show that Translocator significantly outperformed other state-of-the-art methods, including Sniffles and PBSV. On Oxford Nanopore data, the recall improved from 48.2% to 87.5% and the precision from 88.7% to 92.7%. |
Degree | Doctor of Philosophy |
Subject | Medical informatics Data mining |
Dept/Program | Computer Science |
Persistent Identifier | http://hdl.handle.net/10722/302560 |
DC Field | Value | Language |
---|---|---|
dc.contributor.advisor | Ting, HF | - |
dc.contributor.advisor | Lam, TW | - |
dc.contributor.author | 吴晔 | - |
dc.contributor.author | Wu, Ye | - |
dc.date.accessioned | 2021-09-07T03:41:27Z | - |
dc.date.available | 2021-09-07T03:41:27Z | - |
dc.date.issued | 2020 | - |
dc.identifier.citation | 吴晔, [Wu, Ye]. (2020). Accurate text and data mining methods for biomedical knowledge discovery. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. | - |
dc.identifier.uri | http://hdl.handle.net/10722/302560 | - |
dc.description.abstract | Biomedical literature and high-throughput sequencing data are important sources of knowledge for developing better diagnoses and treatments. However, the high complexity and large volume of biomedical literature and sequencing data pose significant challenges for knowledge discovery and translation. This thesis introduces three novel biomedical text and data mining methods for automatic and accurate knowledge discovery. All of them were benchmarked in comprehensive experiments to show flexibility and advantage over existing methods. The first method introduced is RENET, a deep learning approach for gene-disease relation extraction in biomedical literature. Existing text-mining tools for extracting gene-disease associations have limited capacity, as each sentence is considered separately. Our experiments show that the best existing tools, such as BeFree and DTMiner, achieve precision of 48% and recall rate of 78% at most. In this study, we designed and implemented a deep learning approach, named RENET, which considers the correlation between the sentences in an article to extract gene-disease associations. Our method has significantly improved the precision and recall rate to 85.2% and 81.8%, respectively. The second method to present is BioNumQA-BERT, a deep language representation model using numerical facts for biomedical question answering. The current biomedical QA methods have limited capacity, as they commonly neglect the role of numerical facts in biomedical QA. We designed a new method called BioNumQA-BERT by introducing a novel numerical encoding scheme into the popular biomedical language model BioBERT to represent the numerical values in the input text. Our experiments show that BioNumQA-BERT significantly outperformed other state-of-art models, including DrQA, BERT, and BioBERT (39.0% vs 29.5%, 31.3%, and 33.2%, respectively, in strict accuracy). To improve the generalization ability of BioNumQA-BERT, we further pretrained it on a large biomedical text corpus and achieved 41.5% strict accuracy. The third method is Translocator, a translocation detection method for single-molecular long-read sequencing data. The recent development in single-molecule sequencing technologies that produce long reads has promised an advance in detecting translocations accurately. However, existing tools struggled with the high base error rate of the long reads. Figuring out the correct translocation breakpoints is especially challenging due to suboptimally aligned reads. To address the problem, we developed Translocator, a robust and accurate translocation detection method that implements an effective realignment algorithm to recover the correct alignments. For benchmarking, we analyzed using NA12878 long reads against a modified GRCh38 reference genome embedded with translocations at known locations. Our results show that Translocator significantly outperformed other state-of-the-art methods, including Sniffles and PBSV. On Oxford Nanopore data, the recall improved from 48.2% to 87.5% and the precision from 88.7% to 92.7%. | - |
dc.language | eng | - |
dc.publisher | The University of Hong Kong (Pokfulam, Hong Kong) | - |
dc.relation.ispartof | HKU Theses Online (HKUTO) | - |
dc.rights | The author retains all proprietary rights, (such as patent rights) and the right to use in future works. | - |
dc.rights | This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. | - |
dc.subject.lcsh | Medical informatics | - |
dc.subject.lcsh | Data mining | - |
dc.title | Accurate text and data mining methods for biomedical knowledge discovery | - |
dc.type | PG_Thesis | - |
dc.description.thesisname | Doctor of Philosophy | - |
dc.description.thesislevel | Doctoral | - |
dc.description.thesisdiscipline | Computer Science | - |
dc.description.nature | published_or_final_version | - |
dc.date.hkucongregation | 2021 | - |
dc.identifier.mmsid | 991044410250103414 | - |