File Download
Supplementary
-
Citations:
- Appears in Collections:
postgraduate thesis: Aligning multiple sequences adaptively
Title | Aligning multiple sequences adaptively |
---|---|
Authors | |
Advisors | Advisor(s):Ting, HF |
Issue Date | 2014 |
Publisher | The University of Hong Kong (Pokfulam, Hong Kong) |
Citation | Ye, Y. [叶永滔]. (2014). Aligning multiple sequences adaptively. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. Retrieved from http://dx.doi.org/10.5353/th_b5317071 |
Abstract | With the rapid development of genome sequencing, an ever-increasing number of molecular biology analyses rely on the construction of an accurate multiple sequence alignment (MSA), such as motifs detection, phylogeny inference and structure prediction. Although many methods have been developed during the last two decades, most of them may perform poorly on some types of inputs, in particular when families of sequences fall below thirty percent similarity. Therefore, this thesis introduced two different effective approaches to improve the overall quality of multiple sequence alignment.
First, by considering the similarity of the input sequences, we proposed an adaptive approach to compute better substitution matrices for each pair of sequences, and then apply the progressive alignment method to align them. For example, for inputs with high similarity, we consider the whole sequences and align them with global pair-Hidden Markov model, while for those with moderate low similarity, we may ignore the ank regions and use some local pair-Hidden Markov models to align them. To test the effectiveness of this approach, we have implemented a multiple sequence alignment tool called GLProbs and compared its performance with one dozen leading tools on three benchmark alignment databases, and GLProbs' alignments have the best scores in almost all testings. We have also evaluated the practicability of the alignments of GLProbs by applying the tool to three biological applications, namely phylogenetic tree reconstruction, protein secondary structure prediction and the detection of high risk members for cervical cancer in the HPV-E6 family, and the results are very encouraging.
Second, based on our previous study, we proposed another new tool PnpProbs, which constructs better multiple sequence alignments by better handling of guide trees. It classifies input sequences into two types: normally related sequences and distantly related sequences. For normally related sequences, it uses an adaptive approach to construct the guide tree, and based on this guide tree, aligns the sequences progressively. To be more precise, it first estimates the input's discrepancy by computing the standard deviation of their percent identities, and based on this estimate, it chooses the best method to construct the guide tree. For distantly related sequences, PnpProbs abandons the guide tree; instead it uses the non-progressive sequence annealing method to construct the multiple sequence alignment. By combining the strength of the progressive and non-progressive methods, and with a better way to construct the guide tree, PnpProbs improves the quality of multiple sequence alignments significantly for not only general input sequences, but also those very distantly related.
With those encouraging empirical results, our developed software tools have been appreciated by the community gradually. For example, GLProbs has been invited and incorporated into the JAva Bioinformatics Analysis Web Services system (JABAWS). |
Degree | Master of Philosophy |
Subject | Sequence alignment (Bioinformatics) |
Dept/Program | Computer Science |
Persistent Identifier | http://hdl.handle.net/10722/206465 |
HKU Library Item ID | b5317071 |
DC Field | Value | Language |
---|---|---|
dc.contributor.advisor | Ting, HF | - |
dc.contributor.author | Ye, Yongtao | - |
dc.contributor.author | 叶永滔 | - |
dc.date.accessioned | 2014-10-31T23:15:57Z | - |
dc.date.available | 2014-10-31T23:15:57Z | - |
dc.date.issued | 2014 | - |
dc.identifier.citation | Ye, Y. [叶永滔]. (2014). Aligning multiple sequences adaptively. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. Retrieved from http://dx.doi.org/10.5353/th_b5317071 | - |
dc.identifier.uri | http://hdl.handle.net/10722/206465 | - |
dc.description.abstract | With the rapid development of genome sequencing, an ever-increasing number of molecular biology analyses rely on the construction of an accurate multiple sequence alignment (MSA), such as motifs detection, phylogeny inference and structure prediction. Although many methods have been developed during the last two decades, most of them may perform poorly on some types of inputs, in particular when families of sequences fall below thirty percent similarity. Therefore, this thesis introduced two different effective approaches to improve the overall quality of multiple sequence alignment. First, by considering the similarity of the input sequences, we proposed an adaptive approach to compute better substitution matrices for each pair of sequences, and then apply the progressive alignment method to align them. For example, for inputs with high similarity, we consider the whole sequences and align them with global pair-Hidden Markov model, while for those with moderate low similarity, we may ignore the ank regions and use some local pair-Hidden Markov models to align them. To test the effectiveness of this approach, we have implemented a multiple sequence alignment tool called GLProbs and compared its performance with one dozen leading tools on three benchmark alignment databases, and GLProbs' alignments have the best scores in almost all testings. We have also evaluated the practicability of the alignments of GLProbs by applying the tool to three biological applications, namely phylogenetic tree reconstruction, protein secondary structure prediction and the detection of high risk members for cervical cancer in the HPV-E6 family, and the results are very encouraging. Second, based on our previous study, we proposed another new tool PnpProbs, which constructs better multiple sequence alignments by better handling of guide trees. It classifies input sequences into two types: normally related sequences and distantly related sequences. For normally related sequences, it uses an adaptive approach to construct the guide tree, and based on this guide tree, aligns the sequences progressively. To be more precise, it first estimates the input's discrepancy by computing the standard deviation of their percent identities, and based on this estimate, it chooses the best method to construct the guide tree. For distantly related sequences, PnpProbs abandons the guide tree; instead it uses the non-progressive sequence annealing method to construct the multiple sequence alignment. By combining the strength of the progressive and non-progressive methods, and with a better way to construct the guide tree, PnpProbs improves the quality of multiple sequence alignments significantly for not only general input sequences, but also those very distantly related. With those encouraging empirical results, our developed software tools have been appreciated by the community gradually. For example, GLProbs has been invited and incorporated into the JAva Bioinformatics Analysis Web Services system (JABAWS). | - |
dc.language | eng | - |
dc.publisher | The University of Hong Kong (Pokfulam, Hong Kong) | - |
dc.relation.ispartof | HKU Theses Online (HKUTO) | - |
dc.rights | The author retains all proprietary rights, (such as patent rights) and the right to use in future works. | - |
dc.rights | This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. | - |
dc.subject.lcsh | Sequence alignment (Bioinformatics) | - |
dc.title | Aligning multiple sequences adaptively | - |
dc.type | PG_Thesis | - |
dc.identifier.hkul | b5317071 | - |
dc.description.thesisname | Master of Philosophy | - |
dc.description.thesislevel | Master | - |
dc.description.thesisdiscipline | Computer Science | - |
dc.description.nature | published_or_final_version | - |
dc.identifier.doi | 10.5353/th_b5317071 | - |
dc.identifier.mmsid | 991039908669703414 | - |