Data-centric approaches for better multiple sequence alignment

Kuang, Mengmeng; 匡盟盟

File Download

FullText.pdf

Supplementary

Citations:
Appears in Collections:
- HKU Theses Online
- Computer Science: Theses

postgraduate thesis: Data-centric approaches for better multiple sequence alignment

Title	Data-centric approaches for better multiple sequence alignment
Authors	Kuang, Mengmeng 匡盟盟
Advisors	Advisor(s):Ting, HF
Issue Date	2020
Publisher	The University of Hong Kong (Pokfulam, Hong Kong)
Citation	Kuang, M. [匡盟盟]. (2020). Data-centric approaches for better multiple sequence alignment. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
Abstract	In this thesis, we investigated the use of the data-centric approach to tackle the Multiple Sequence Alignment (MSA) construction problems. Unlike the algorithm-centric approach, which reduces the construction problem to a combinatorial optimization problem based on an abstract mathematical model, the data-centric approach uses models trained from existing data to guide the construction. In our first study, we identified a simple classifier to help us choose the best alignment tool. Then to correct the original alignment error, we added a post-processing process, which is a region-centric realignment process. At the same time, we performed a classifier for different families to adopt the appropriate realignment strategy. In our second study, we delved deeper into how to add deep-learning methods to the underlying steps of the progressive alignment method. To improve the accuracy of the progressive alignment method, we first determined the best promotion part and then trained a decision-making model for that part to guide the MSA construction process. Accordingly, we released two complete new MSA tools based on the two studies: MLProbs in the first study and DLPAlign in the second. We compared them with about 10 other popular MSA tools against several commonly used empirical benchmarks. The results showed that these two tools improved the accuracy of MSA to a certain extent on all tests. Furthermore, when we tested them on low-similarity protein families, our methods had unexpectedly good results. MLProbs resulted in a 2.9% TC-score improvement on families with PID <= 50%, while DLPAlign achieve 2.8% TC-score growth on families with PID <= 30%. Moreover, these two new MSA methods can obtain good results in real-life applications.
Degree	Master of Philosophy
Subject	Sequence alignment (Bioinformatics)
Dept/Program	Computer Science
Persistent Identifier	http://hdl.handle.net/10722/287514

DC Field	Value	Language
dc.contributor.advisor	Ting, HF	-
dc.contributor.author	Kuang, Mengmeng	-
dc.contributor.author	匡盟盟	-
dc.date.accessioned	2020-10-01T04:31:57Z	-
dc.date.available	2020-10-01T04:31:57Z	-
dc.date.issued	2020	-
dc.identifier.citation	Kuang, M. [匡盟盟]. (2020). Data-centric approaches for better multiple sequence alignment. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.	-
dc.identifier.uri	http://hdl.handle.net/10722/287514	-
dc.description.abstract	In this thesis, we investigated the use of the data-centric approach to tackle the Multiple Sequence Alignment (MSA) construction problems. Unlike the algorithm-centric approach, which reduces the construction problem to a combinatorial optimization problem based on an abstract mathematical model, the data-centric approach uses models trained from existing data to guide the construction. In our first study, we identified a simple classifier to help us choose the best alignment tool. Then to correct the original alignment error, we added a post-processing process, which is a region-centric realignment process. At the same time, we performed a classifier for different families to adopt the appropriate realignment strategy. In our second study, we delved deeper into how to add deep-learning methods to the underlying steps of the progressive alignment method. To improve the accuracy of the progressive alignment method, we first determined the best promotion part and then trained a decision-making model for that part to guide the MSA construction process. Accordingly, we released two complete new MSA tools based on the two studies: MLProbs in the first study and DLPAlign in the second. We compared them with about 10 other popular MSA tools against several commonly used empirical benchmarks. The results showed that these two tools improved the accuracy of MSA to a certain extent on all tests. Furthermore, when we tested them on low-similarity protein families, our methods had unexpectedly good results. MLProbs resulted in a 2.9% TC-score improvement on families with PID <= 50%, while DLPAlign achieve 2.8% TC-score growth on families with PID <= 30%. Moreover, these two new MSA methods can obtain good results in real-life applications.	-
dc.language	eng	-
dc.publisher	The University of Hong Kong (Pokfulam, Hong Kong)	-
dc.relation.ispartof	HKU Theses Online (HKUTO)	-
dc.rights	The author retains all proprietary rights, (such as patent rights) and the right to use in future works.	-
dc.rights	This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.	-
dc.subject.lcsh	Sequence alignment (Bioinformatics)	-
dc.title	Data-centric approaches for better multiple sequence alignment	-
dc.type	PG_Thesis	-
dc.description.thesisname	Master of Philosophy	-
dc.description.thesislevel	Master	-
dc.description.thesisdiscipline	Computer Science	-
dc.description.nature	published_or_final_version	-
dc.date.hkucongregation	2020	-
dc.identifier.mmsid	991044284999003414	-

File Download

Supplementary

postgraduate thesis: Data-centric approaches for better multiple sequence alignment

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats