File Download
Supplementary

postgraduate thesis: Data-centric approaches for better multiple sequence alignment

TitleData-centric approaches for better multiple sequence alignment
Authors
Advisors
Advisor(s):Ting, HF
Issue Date2020
PublisherThe University of Hong Kong (Pokfulam, Hong Kong)
Citation
Kuang, M. [匡盟盟]. (2020). Data-centric approaches for better multiple sequence alignment. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
AbstractIn this thesis, we investigated the use of the data-centric approach to tackle the Multiple Sequence Alignment (MSA) construction problems. Unlike the algorithm-centric approach, which reduces the construction problem to a combinatorial optimization problem based on an abstract mathematical model, the data-centric approach uses models trained from existing data to guide the construction. In our first study, we identified a simple classifier to help us choose the best alignment tool. Then to correct the original alignment error, we added a post-processing process, which is a region-centric realignment process. At the same time, we performed a classifier for different families to adopt the appropriate realignment strategy. In our second study, we delved deeper into how to add deep-learning methods to the underlying steps of the progressive alignment method. To improve the accuracy of the progressive alignment method, we first determined the best promotion part and then trained a decision-making model for that part to guide the MSA construction process. Accordingly, we released two complete new MSA tools based on the two studies: MLProbs in the first study and DLPAlign in the second. We compared them with about 10 other popular MSA tools against several commonly used empirical benchmarks. The results showed that these two tools improved the accuracy of MSA to a certain extent on all tests. Furthermore, when we tested them on low-similarity protein families, our methods had unexpectedly good results. MLProbs resulted in a 2.9% TC-score improvement on families with PID <= 50%, while DLPAlign achieve 2.8% TC-score growth on families with PID <= 30%. Moreover, these two new MSA methods can obtain good results in real-life applications.
DegreeMaster of Philosophy
SubjectSequence alignment (Bioinformatics)
Dept/ProgramComputer Science
Persistent Identifierhttp://hdl.handle.net/10722/287514

 

DC FieldValueLanguage
dc.contributor.advisorTing, HF-
dc.contributor.authorKuang, Mengmeng-
dc.contributor.author匡盟盟-
dc.date.accessioned2020-10-01T04:31:57Z-
dc.date.available2020-10-01T04:31:57Z-
dc.date.issued2020-
dc.identifier.citationKuang, M. [匡盟盟]. (2020). Data-centric approaches for better multiple sequence alignment. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.-
dc.identifier.urihttp://hdl.handle.net/10722/287514-
dc.description.abstractIn this thesis, we investigated the use of the data-centric approach to tackle the Multiple Sequence Alignment (MSA) construction problems. Unlike the algorithm-centric approach, which reduces the construction problem to a combinatorial optimization problem based on an abstract mathematical model, the data-centric approach uses models trained from existing data to guide the construction. In our first study, we identified a simple classifier to help us choose the best alignment tool. Then to correct the original alignment error, we added a post-processing process, which is a region-centric realignment process. At the same time, we performed a classifier for different families to adopt the appropriate realignment strategy. In our second study, we delved deeper into how to add deep-learning methods to the underlying steps of the progressive alignment method. To improve the accuracy of the progressive alignment method, we first determined the best promotion part and then trained a decision-making model for that part to guide the MSA construction process. Accordingly, we released two complete new MSA tools based on the two studies: MLProbs in the first study and DLPAlign in the second. We compared them with about 10 other popular MSA tools against several commonly used empirical benchmarks. The results showed that these two tools improved the accuracy of MSA to a certain extent on all tests. Furthermore, when we tested them on low-similarity protein families, our methods had unexpectedly good results. MLProbs resulted in a 2.9% TC-score improvement on families with PID <= 50%, while DLPAlign achieve 2.8% TC-score growth on families with PID <= 30%. Moreover, these two new MSA methods can obtain good results in real-life applications.-
dc.languageeng-
dc.publisherThe University of Hong Kong (Pokfulam, Hong Kong)-
dc.relation.ispartofHKU Theses Online (HKUTO)-
dc.rightsThe author retains all proprietary rights, (such as patent rights) and the right to use in future works.-
dc.rightsThis work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.-
dc.subject.lcshSequence alignment (Bioinformatics)-
dc.titleData-centric approaches for better multiple sequence alignment-
dc.typePG_Thesis-
dc.description.thesisnameMaster of Philosophy-
dc.description.thesislevelMaster-
dc.description.thesisdisciplineComputer Science-
dc.description.naturepublished_or_final_version-
dc.date.hkucongregation2020-
dc.identifier.mmsid991044284999003414-

Export via OAI-PMH Interface in XML Formats


OR


Export to Other Non-XML Formats