File Download
Supplementary
-
Citations:
- Appears in Collections:
postgraduate thesis: Time-efficient and highly sensitive solutions for large-scale sequence alignments
Title | Time-efficient and highly sensitive solutions for large-scale sequence alignments |
---|---|
Authors | |
Advisors | Advisor(s):Ting, HF |
Issue Date | 2018 |
Publisher | The University of Hong Kong (Pokfulam, Hong Kong) |
Citation | Mai, H. [麥慧君]. (2018). Time-efficient and highly sensitive solutions for large-scale sequence alignments. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. |
Abstract | Sequence alignment has been a widely used and effective methodology to explore the functional characteristics of sequences. As the rapid advancement of next-generation sequencing (NGS) technologies makes more and more biological sequences available to study, it has been a challenge to analyse the newly generated genomic data efficiently. This thesis introduces software solutions for aligning large amounts of DNA sequences efficiently and sensitively.
The first software we designed and implemented is LASTM, which is for the problem of whole-genome alignment. This problem often involves comparing two long genomes with billions of base pairs. Before LASTM, some existing tools have made the comparison of two large genomes possible and efficient with a sacrifice in sensitivity. However, they become very slow if the extra sensitivity is needed. LASTM is developed to handle this problem and proposes a simple but effective method to improve the sensitivity of existing whole-genome alignment software without spending too much extra running time.
Our second software is AC-DIAMOND, which is used for rapid and sensitive DNA-protein alignment. The computational bottlenecks of previous DNA-protein aligners limited their applications on the alignment of large-scale datasets against a protein database. We implemented a time-efficient aligner, called AC-DIAMOND to tackle the computational bottlenecks. The first version of AC-DIAMOND, namely AC-DIAMOND v0, it reduces the time of reloading same reference sequences and reconstructing same reference indexes by compressing reference indexes. Moreover, AC-DIAMOND v0 exploits SIMD technologies to accelerate the time-consuming dynamic programming process. When aligning large amounts of long reads or assembled contigs to protein databases, AC-DIAMOND v0 gained a 4-fold speed-up. Recently, AC-DIAMOND v0 has been applied to the pathogen detection pipeline MegaPath to solve real clinical problems in a sensitive and fast manner.
To further improve AC-DIAMOND v0, we designed and implemented AC-DIAMOND v1. By making use of an even compressed reference index and adopting the adaptive seed-length search, AC-DIAMOND v1 provides an more effective method to locate seeds between the dataset and protein database. In addition, AC-DIAMOND v1 uses a better SIMD implementation and packing strategy to parallelize the dynamic programming process. With these new improvements, AC-DIAMOND v1 saved nearly 40% of running time of the previous version v0 and achieved a 7-fold speed-up compared with DIAMOND. Most importantly, AC-DIAMOND did not sacrifice the sensitivity and provided the similar sensitivity as previous pioneer aligner DIAMOND. |
Degree | Doctor of Philosophy |
Subject | Sequence alignment (Bioinformatics) |
Dept/Program | Computer Science |
Persistent Identifier | http://hdl.handle.net/10722/261462 |
DC Field | Value | Language |
---|---|---|
dc.contributor.advisor | Ting, HF | - |
dc.contributor.author | Mai, Huijun | - |
dc.contributor.author | 麥慧君 | - |
dc.date.accessioned | 2018-09-20T06:43:46Z | - |
dc.date.available | 2018-09-20T06:43:46Z | - |
dc.date.issued | 2018 | - |
dc.identifier.citation | Mai, H. [麥慧君]. (2018). Time-efficient and highly sensitive solutions for large-scale sequence alignments. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. | - |
dc.identifier.uri | http://hdl.handle.net/10722/261462 | - |
dc.description.abstract | Sequence alignment has been a widely used and effective methodology to explore the functional characteristics of sequences. As the rapid advancement of next-generation sequencing (NGS) technologies makes more and more biological sequences available to study, it has been a challenge to analyse the newly generated genomic data efficiently. This thesis introduces software solutions for aligning large amounts of DNA sequences efficiently and sensitively. The first software we designed and implemented is LASTM, which is for the problem of whole-genome alignment. This problem often involves comparing two long genomes with billions of base pairs. Before LASTM, some existing tools have made the comparison of two large genomes possible and efficient with a sacrifice in sensitivity. However, they become very slow if the extra sensitivity is needed. LASTM is developed to handle this problem and proposes a simple but effective method to improve the sensitivity of existing whole-genome alignment software without spending too much extra running time. Our second software is AC-DIAMOND, which is used for rapid and sensitive DNA-protein alignment. The computational bottlenecks of previous DNA-protein aligners limited their applications on the alignment of large-scale datasets against a protein database. We implemented a time-efficient aligner, called AC-DIAMOND to tackle the computational bottlenecks. The first version of AC-DIAMOND, namely AC-DIAMOND v0, it reduces the time of reloading same reference sequences and reconstructing same reference indexes by compressing reference indexes. Moreover, AC-DIAMOND v0 exploits SIMD technologies to accelerate the time-consuming dynamic programming process. When aligning large amounts of long reads or assembled contigs to protein databases, AC-DIAMOND v0 gained a 4-fold speed-up. Recently, AC-DIAMOND v0 has been applied to the pathogen detection pipeline MegaPath to solve real clinical problems in a sensitive and fast manner. To further improve AC-DIAMOND v0, we designed and implemented AC-DIAMOND v1. By making use of an even compressed reference index and adopting the adaptive seed-length search, AC-DIAMOND v1 provides an more effective method to locate seeds between the dataset and protein database. In addition, AC-DIAMOND v1 uses a better SIMD implementation and packing strategy to parallelize the dynamic programming process. With these new improvements, AC-DIAMOND v1 saved nearly 40% of running time of the previous version v0 and achieved a 7-fold speed-up compared with DIAMOND. Most importantly, AC-DIAMOND did not sacrifice the sensitivity and provided the similar sensitivity as previous pioneer aligner DIAMOND. | - |
dc.language | eng | - |
dc.publisher | The University of Hong Kong (Pokfulam, Hong Kong) | - |
dc.relation.ispartof | HKU Theses Online (HKUTO) | - |
dc.rights | The author retains all proprietary rights, (such as patent rights) and the right to use in future works. | - |
dc.rights | This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. | - |
dc.subject.lcsh | Sequence alignment (Bioinformatics) | - |
dc.title | Time-efficient and highly sensitive solutions for large-scale sequence alignments | - |
dc.type | PG_Thesis | - |
dc.description.thesisname | Doctor of Philosophy | - |
dc.description.thesislevel | Doctoral | - |
dc.description.thesisdiscipline | Computer Science | - |
dc.description.nature | published_or_final_version | - |
dc.identifier.doi | 10.5353/th_991044040580903414 | - |
dc.date.hkucongregation | 2018 | - |
dc.identifier.mmsid | 991044040580903414 | - |