Time-efficient and highly sensitive solutions for large-scale sequence alignments

Mai, Huijun; 麥慧君

File Download

FullText.pdf

Links for fulltext

(May Require Subscription)

DOI: 10.5353/th_991044040580903414

Supplementary

Citations:
Appears in Collections:
- HKU Theses Online
- Computer Science: Theses

postgraduate thesis: Time-efficient and highly sensitive solutions for large-scale sequence alignments

Title	Time-efficient and highly sensitive solutions for large-scale sequence alignments
Authors	Mai, Huijun 麥慧君
Advisors	Advisor(s):Ting, HF
Issue Date	2018
Publisher	The University of Hong Kong (Pokfulam, Hong Kong)
Citation	Mai, H. [麥慧君]. (2018). Time-efficient and highly sensitive solutions for large-scale sequence alignments. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
Abstract	Sequence alignment has been a widely used and effective methodology to explore the functional characteristics of sequences. As the rapid advancement of next-generation sequencing (NGS) technologies makes more and more biological sequences available to study, it has been a challenge to analyse the newly generated genomic data efficiently. This thesis introduces software solutions for aligning large amounts of DNA sequences efficiently and sensitively. The first software we designed and implemented is LASTM, which is for the problem of whole-genome alignment. This problem often involves comparing two long genomes with billions of base pairs. Before LASTM, some existing tools have made the comparison of two large genomes possible and efficient with a sacrifice in sensitivity. However, they become very slow if the extra sensitivity is needed. LASTM is developed to handle this problem and proposes a simple but effective method to improve the sensitivity of existing whole-genome alignment software without spending too much extra running time. Our second software is AC-DIAMOND, which is used for rapid and sensitive DNA-protein alignment. The computational bottlenecks of previous DNA-protein aligners limited their applications on the alignment of large-scale datasets against a protein database. We implemented a time-efficient aligner, called AC-DIAMOND to tackle the computational bottlenecks. The first version of AC-DIAMOND, namely AC-DIAMOND v0, it reduces the time of reloading same reference sequences and reconstructing same reference indexes by compressing reference indexes. Moreover, AC-DIAMOND v0 exploits SIMD technologies to accelerate the time-consuming dynamic programming process. When aligning large amounts of long reads or assembled contigs to protein databases, AC-DIAMOND v0 gained a 4-fold speed-up. Recently, AC-DIAMOND v0 has been applied to the pathogen detection pipeline MegaPath to solve real clinical problems in a sensitive and fast manner. To further improve AC-DIAMOND v0, we designed and implemented AC-DIAMOND v1. By making use of an even compressed reference index and adopting the adaptive seed-length search, AC-DIAMOND v1 provides an more effective method to locate seeds between the dataset and protein database. In addition, AC-DIAMOND v1 uses a better SIMD implementation and packing strategy to parallelize the dynamic programming process. With these new improvements, AC-DIAMOND v1 saved nearly 40% of running time of the previous version v0 and achieved a 7-fold speed-up compared with DIAMOND. Most importantly, AC-DIAMOND did not sacrifice the sensitivity and provided the similar sensitivity as previous pioneer aligner DIAMOND.
Degree	Doctor of Philosophy
Subject	Sequence alignment (Bioinformatics)
Dept/Program	Computer Science
Persistent Identifier	http://hdl.handle.net/10722/261462

DC Field	Value	Language
dc.contributor.advisor	Ting, HF	-
dc.contributor.author	Mai, Huijun	-
dc.contributor.author	麥慧君	-
dc.date.accessioned	2018-09-20T06:43:46Z	-
dc.date.available	2018-09-20T06:43:46Z	-
dc.date.issued	2018	-
dc.identifier.citation	Mai, H. [麥慧君]. (2018). Time-efficient and highly sensitive solutions for large-scale sequence alignments. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.	-
dc.identifier.uri	http://hdl.handle.net/10722/261462	-
dc.description.abstract	Sequence alignment has been a widely used and effective methodology to explore the functional characteristics of sequences. As the rapid advancement of next-generation sequencing (NGS) technologies makes more and more biological sequences available to study, it has been a challenge to analyse the newly generated genomic data efficiently. This thesis introduces software solutions for aligning large amounts of DNA sequences efficiently and sensitively. The first software we designed and implemented is LASTM, which is for the problem of whole-genome alignment. This problem often involves comparing two long genomes with billions of base pairs. Before LASTM, some existing tools have made the comparison of two large genomes possible and efficient with a sacrifice in sensitivity. However, they become very slow if the extra sensitivity is needed. LASTM is developed to handle this problem and proposes a simple but effective method to improve the sensitivity of existing whole-genome alignment software without spending too much extra running time. Our second software is AC-DIAMOND, which is used for rapid and sensitive DNA-protein alignment. The computational bottlenecks of previous DNA-protein aligners limited their applications on the alignment of large-scale datasets against a protein database. We implemented a time-efficient aligner, called AC-DIAMOND to tackle the computational bottlenecks. The first version of AC-DIAMOND, namely AC-DIAMOND v0, it reduces the time of reloading same reference sequences and reconstructing same reference indexes by compressing reference indexes. Moreover, AC-DIAMOND v0 exploits SIMD technologies to accelerate the time-consuming dynamic programming process. When aligning large amounts of long reads or assembled contigs to protein databases, AC-DIAMOND v0 gained a 4-fold speed-up. Recently, AC-DIAMOND v0 has been applied to the pathogen detection pipeline MegaPath to solve real clinical problems in a sensitive and fast manner. To further improve AC-DIAMOND v0, we designed and implemented AC-DIAMOND v1. By making use of an even compressed reference index and adopting the adaptive seed-length search, AC-DIAMOND v1 provides an more effective method to locate seeds between the dataset and protein database. In addition, AC-DIAMOND v1 uses a better SIMD implementation and packing strategy to parallelize the dynamic programming process. With these new improvements, AC-DIAMOND v1 saved nearly 40% of running time of the previous version v0 and achieved a 7-fold speed-up compared with DIAMOND. Most importantly, AC-DIAMOND did not sacrifice the sensitivity and provided the similar sensitivity as previous pioneer aligner DIAMOND.	-
dc.language	eng	-
dc.publisher	The University of Hong Kong (Pokfulam, Hong Kong)	-
dc.relation.ispartof	HKU Theses Online (HKUTO)	-
dc.rights	The author retains all proprietary rights, (such as patent rights) and the right to use in future works.	-
dc.rights	This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.	-
dc.subject.lcsh	Sequence alignment (Bioinformatics)	-
dc.title	Time-efficient and highly sensitive solutions for large-scale sequence alignments	-
dc.type	PG_Thesis	-
dc.description.thesisname	Doctor of Philosophy	-
dc.description.thesislevel	Doctoral	-
dc.description.thesisdiscipline	Computer Science	-
dc.description.nature	published_or_final_version	-
dc.identifier.doi	10.5353/th_991044040580903414	-
dc.date.hkucongregation	2018	-
dc.identifier.mmsid	991044040580903414	-

File Download

Links for fulltext

(May Require Subscription)

Supplementary

postgraduate thesis: Time-efficient and highly sensitive solutions for large-scale sequence alignments

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats