Efficient solutions for bioinformatics applications using GPUs

Liu, Chi-man; 廖志敏

File Download

FullText.pdf

Links for fulltext

(May Require Subscription)

DOI: 10.5353/th_b5481889

Supplementary

Citations:
Appears in Collections:
- Computer Science: Theses
- HKU Theses Online

postgraduate thesis: Efficient solutions for bioinformatics applications using GPUs

Title	Efficient solutions for bioinformatics applications using GPUs
Authors	Liu, Chi-man 廖志敏
Issue Date	2015
Publisher	The University of Hong Kong (Pokfulam, Hong Kong)
Citation	Liu, C. [廖志敏]. (2015). Efficient solutions for bioinformatics applications using GPUs. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. Retrieved from http://dx.doi.org/10.5353/th_b5481889
Abstract	Over the past few years, DNA sequencing technology has been advancing at such a fast pace that computer hardware and software can hardly meet the ever-increasing demand for sequence analysis. A natural approach to boost analysis efficiency is parallelization, which divides the problem into smaller ones that are to be solved simultaneously on multiple execution units. Common architectures such as multi-core CPUs and clusters can increase the throughput to some extent, but the hardware setup and maintenance costs are prohibitive. Fortunately, the newly emerged general-purpose GPU programming paradigm gives us a low-cost alternative for parallelization. This thesis presents GPU-accelerated algorithms for several problems in bioinformatics, along with implementations to demonstrate their power in handling enormous totally different limitations and optimization techniques than the CPU. The first tool presented is SOAP3-dp, which is a DNA short-read aligner highly optimized for speed. Prior to SOAP3-DP, the fastest short-read aligner was its predecessor SOAP2, which was capable of aligning 1 million 100-bp reads in 5 minutes. SOAP3-dp beats this record by aligning the same volume in only 10 seconds. The key to unlocking this unprecedented speed is the revamped BWT engine underlying SOAP3-dp. All data structures and associated operations have been tailor made for the GPU to achieve optimized performance. Experiments show that SOAP3-dp not only excels in speed, but also outperforms other aligners in both alignment sensitivity and accuracy. The next tools are for constructing data structures, namely Burrows-Wheeler transform (BWT) and de Bruijn graphs (DBGs), to facilitate genome assembly of short reads, especially large metagenomics data. The BWT index for a set of short reads has recently found its use in string-graph assemblers [44], as it provides a succinct way of representing huge string graphs which would otherwise exceed the main memory limit. Constructing the BWT index for a million reads is by itself not an easy task, let alone optimize for the GPU. Another class of assemblers, the DBG-based assemblers, also faces the same problem. This thesis presents construction algorithms for both the BWT and DBGs in a succinct form. In our experiments, we constructed the succinct DBG for a metagenomics data set with over 200 gigabases in 3 hours, and the resulting DBG only consumed 31.2 GB of memory. We also constructed the BWT index for 10 million 100-bp reads in 40 minutes using 4 quad-core machines. Lastly, we introduce a SNP detection tool, iSNPcall, which detects SNPs from a set of reads. Given a set of user-supplied annotated SNPs, iSNPcall focuses only on alignments covering these SNPs, which greatly accelerates the detection of SNPs at the prescribed loci. The annotated SNPs also helps us distinguish sequencing errors from authentic SNPs alleles easily. This is in contrast to the traditional de novo method which aligns reads onto the reference genome and then filters inauthentic mismatches according to some probabilities. After comparing on several applications, iSNPcall was found to give a higher accuracy than the de novo method, especially for samples with low coverage.
Degree	Doctor of Philosophy
Subject	Graphics processing units Bioinformatics
Dept/Program	Computer Science
Persistent Identifier	http://hdl.handle.net/10722/211138
HKU Library Item ID	b5481889

DC Field	Value	Language
dc.contributor.author	Liu, Chi-man	-
dc.contributor.author	廖志敏	-
dc.date.accessioned	2015-07-07T23:10:46Z	-
dc.date.available	2015-07-07T23:10:46Z	-
dc.date.issued	2015	-
dc.identifier.citation	Liu, C. [廖志敏]. (2015). Efficient solutions for bioinformatics applications using GPUs. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. Retrieved from http://dx.doi.org/10.5353/th_b5481889	-
dc.identifier.uri	http://hdl.handle.net/10722/211138	-
dc.description.abstract	Over the past few years, DNA sequencing technology has been advancing at such a fast pace that computer hardware and software can hardly meet the ever-increasing demand for sequence analysis. A natural approach to boost analysis efficiency is parallelization, which divides the problem into smaller ones that are to be solved simultaneously on multiple execution units. Common architectures such as multi-core CPUs and clusters can increase the throughput to some extent, but the hardware setup and maintenance costs are prohibitive. Fortunately, the newly emerged general-purpose GPU programming paradigm gives us a low-cost alternative for parallelization. This thesis presents GPU-accelerated algorithms for several problems in bioinformatics, along with implementations to demonstrate their power in handling enormous totally different limitations and optimization techniques than the CPU. The first tool presented is SOAP3-dp, which is a DNA short-read aligner highly optimized for speed. Prior to SOAP3-DP, the fastest short-read aligner was its predecessor SOAP2, which was capable of aligning 1 million 100-bp reads in 5 minutes. SOAP3-dp beats this record by aligning the same volume in only 10 seconds. The key to unlocking this unprecedented speed is the revamped BWT engine underlying SOAP3-dp. All data structures and associated operations have been tailor made for the GPU to achieve optimized performance. Experiments show that SOAP3-dp not only excels in speed, but also outperforms other aligners in both alignment sensitivity and accuracy. The next tools are for constructing data structures, namely Burrows-Wheeler transform (BWT) and de Bruijn graphs (DBGs), to facilitate genome assembly of short reads, especially large metagenomics data. The BWT index for a set of short reads has recently found its use in string-graph assemblers [44], as it provides a succinct way of representing huge string graphs which would otherwise exceed the main memory limit. Constructing the BWT index for a million reads is by itself not an easy task, let alone optimize for the GPU. Another class of assemblers, the DBG-based assemblers, also faces the same problem. This thesis presents construction algorithms for both the BWT and DBGs in a succinct form. In our experiments, we constructed the succinct DBG for a metagenomics data set with over 200 gigabases in 3 hours, and the resulting DBG only consumed 31.2 GB of memory. We also constructed the BWT index for 10 million 100-bp reads in 40 minutes using 4 quad-core machines. Lastly, we introduce a SNP detection tool, iSNPcall, which detects SNPs from a set of reads. Given a set of user-supplied annotated SNPs, iSNPcall focuses only on alignments covering these SNPs, which greatly accelerates the detection of SNPs at the prescribed loci. The annotated SNPs also helps us distinguish sequencing errors from authentic SNPs alleles easily. This is in contrast to the traditional de novo method which aligns reads onto the reference genome and then filters inauthentic mismatches according to some probabilities. After comparing on several applications, iSNPcall was found to give a higher accuracy than the de novo method, especially for samples with low coverage.	-
dc.language	eng	-
dc.publisher	The University of Hong Kong (Pokfulam, Hong Kong)	-
dc.relation.ispartof	HKU Theses Online (HKUTO)	-
dc.rights	The author retains all proprietary rights, (such as patent rights) and the right to use in future works.	-
dc.rights	This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.	-
dc.subject.lcsh	Graphics processing units	-
dc.subject.lcsh	Bioinformatics	-
dc.title	Efficient solutions for bioinformatics applications using GPUs	-
dc.type	PG_Thesis	-
dc.identifier.hkul	b5481889	-
dc.description.thesisname	Doctor of Philosophy	-
dc.description.thesislevel	Doctoral	-
dc.description.thesisdiscipline	Computer Science	-
dc.description.nature	published_or_final_version	-
dc.identifier.doi	10.5353/th_b5481889	-
dc.identifier.mmsid	991005693139703414	-

File Download

Links for fulltext

(May Require Subscription)

Supplementary

postgraduate thesis: Efficient solutions for bioinformatics applications using GPUs

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats