File Download
Supplementary
-
Citations:
- Appears in Collections:
postgraduate thesis: Data-centric approaches for better multiple sequence alignment
Title | Data-centric approaches for better multiple sequence alignment |
---|---|
Authors | |
Advisors | Advisor(s):Ting, HF |
Issue Date | 2020 |
Publisher | The University of Hong Kong (Pokfulam, Hong Kong) |
Citation | Kuang, M. [匡盟盟]. (2020). Data-centric approaches for better multiple sequence alignment. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. |
Abstract | In this thesis, we investigated the use of the data-centric approach to tackle the Multiple Sequence Alignment (MSA) construction problems. Unlike the algorithm-centric approach, which reduces the construction problem to a combinatorial optimization problem based on an abstract mathematical model, the data-centric approach uses models trained from existing data to guide the construction.
In our first study, we identified a simple classifier to help us choose the best alignment tool. Then to correct the original alignment error, we added a post-processing process, which is a region-centric realignment process. At the same time, we performed a classifier for different families to adopt the appropriate realignment strategy. In our second study, we delved deeper into how to add deep-learning methods to the underlying steps of the progressive alignment method. To improve the accuracy of the progressive alignment method, we first determined the best promotion part and then trained a decision-making model for that part to guide the MSA construction process.
Accordingly, we released two complete new MSA tools based on the two studies: MLProbs in the first study and DLPAlign in the second. We compared them with about 10 other popular MSA tools against several commonly used empirical benchmarks. The results showed that these two tools improved the accuracy of MSA to a certain extent on all tests. Furthermore, when we tested them on low-similarity protein families, our methods had unexpectedly good results. MLProbs resulted in a 2.9% TC-score improvement on families with PID <= 50%, while DLPAlign achieve 2.8% TC-score growth on families with PID <= 30%. Moreover, these two new MSA methods can obtain good results in real-life applications. |
Degree | Master of Philosophy |
Subject | Sequence alignment (Bioinformatics) |
Dept/Program | Computer Science |
Persistent Identifier | http://hdl.handle.net/10722/287514 |
DC Field | Value | Language |
---|---|---|
dc.contributor.advisor | Ting, HF | - |
dc.contributor.author | Kuang, Mengmeng | - |
dc.contributor.author | 匡盟盟 | - |
dc.date.accessioned | 2020-10-01T04:31:57Z | - |
dc.date.available | 2020-10-01T04:31:57Z | - |
dc.date.issued | 2020 | - |
dc.identifier.citation | Kuang, M. [匡盟盟]. (2020). Data-centric approaches for better multiple sequence alignment. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. | - |
dc.identifier.uri | http://hdl.handle.net/10722/287514 | - |
dc.description.abstract | In this thesis, we investigated the use of the data-centric approach to tackle the Multiple Sequence Alignment (MSA) construction problems. Unlike the algorithm-centric approach, which reduces the construction problem to a combinatorial optimization problem based on an abstract mathematical model, the data-centric approach uses models trained from existing data to guide the construction. In our first study, we identified a simple classifier to help us choose the best alignment tool. Then to correct the original alignment error, we added a post-processing process, which is a region-centric realignment process. At the same time, we performed a classifier for different families to adopt the appropriate realignment strategy. In our second study, we delved deeper into how to add deep-learning methods to the underlying steps of the progressive alignment method. To improve the accuracy of the progressive alignment method, we first determined the best promotion part and then trained a decision-making model for that part to guide the MSA construction process. Accordingly, we released two complete new MSA tools based on the two studies: MLProbs in the first study and DLPAlign in the second. We compared them with about 10 other popular MSA tools against several commonly used empirical benchmarks. The results showed that these two tools improved the accuracy of MSA to a certain extent on all tests. Furthermore, when we tested them on low-similarity protein families, our methods had unexpectedly good results. MLProbs resulted in a 2.9% TC-score improvement on families with PID <= 50%, while DLPAlign achieve 2.8% TC-score growth on families with PID <= 30%. Moreover, these two new MSA methods can obtain good results in real-life applications. | - |
dc.language | eng | - |
dc.publisher | The University of Hong Kong (Pokfulam, Hong Kong) | - |
dc.relation.ispartof | HKU Theses Online (HKUTO) | - |
dc.rights | The author retains all proprietary rights, (such as patent rights) and the right to use in future works. | - |
dc.rights | This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. | - |
dc.subject.lcsh | Sequence alignment (Bioinformatics) | - |
dc.title | Data-centric approaches for better multiple sequence alignment | - |
dc.type | PG_Thesis | - |
dc.description.thesisname | Master of Philosophy | - |
dc.description.thesislevel | Master | - |
dc.description.thesisdiscipline | Computer Science | - |
dc.description.nature | published_or_final_version | - |
dc.date.hkucongregation | 2020 | - |
dc.identifier.mmsid | 991044284999003414 | - |