Efficient analysis solution for DNA short-read sequencing

Law, Wai-chun; 羅維進

File Download

FullText.pdf

Links for fulltext

(May Require Subscription)

DOI: 10.5353/th_b5801690

Supplementary

Citations:
Appears in Collections:
- HKU Theses Online
- Computer Science: Theses

postgraduate thesis: Efficient analysis solution for DNA short-read sequencing

Title	Efficient analysis solution for DNA short-read sequencing
Authors	Law, Wai-chun 羅維進
Issue Date	2016
Publisher	The University of Hong Kong (Pokfulam, Hong Kong)
Citation	Law, W. [羅維進]. (2016). Efficient analysis solution for DNA short-read sequencing. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
Abstract	In recent years, the demand for DNA sequencing analysis has been boosted with the advance of DNA sequencing technologies; exceeding the capacities of high-end computer servers. This thesis presents integrated software solutions for popular DNA Sequencing analyses, along with implementation and experiments with real data to demonstrate the strength of the solutions over conventional solutions. The first software tool presented is BALSA, which integrates the DNA pairend short reads aligner SOAP3-dp with a newly designed secondary analysis. BALSA finishes 30x Whole-Genome Analysis (WGA) within 6 hours. The well-known pipeline BWA+GATK takes about 20 hours for the same analysis. BALSA’s efficiency is rooted at its fast alignment algorithm and an integrated design that significantly reduces the time spent on file IO. More importantly, experiments show that variant calling accuracy and sensitivity of BALSA are competitive to other existing solutions. The second tool presented is BALSA-Amplicon, which is designed for amplicon sequencing analysis. Unlike WGA, amplicon sequencing data come along with artificial primers, which will contaminate the analysis. A common fix is to trim the reads at the beginning, but this also removes useful data that helps to map the read correctly. BALSA takes advantage of aligning with the primer and only trims it when updating the in-memory alignment information data structure. The sequencing depth of amplicon data could also be several thousands of times of that of WGA data. The data structure has been modified to support the high sequencing depth without degrading the performance. Experiments show BALSA-amplicon takes 20 minutes for calling variants from 3 million of 275bp amplicon short-read pairs. Thirdly, we introduce a short-read aligner SOAP4, targeted on aligning short-read pairs with read length larger than or equal to 150bp (the current standard of high-throughput sequencers like HiSeq 10X). Unlike 100bp reads, the number of mismatches along the reads are generally greater than 2. SOAP3-dp is unable to be aligned quickly by using BWT index solely. SOAP4 highly adapts the seed-and-extend strategy. Experiments, with real data with 250bp length, show that SOAP4 is 8% faster than SOAP3-dp and the sensitivity of SOAP4 is 95.83%, compared to 85.32% of SOAP3-dp. And simulated data experiments show SOAP4 gives competitive accuracy compare with SOAP3-dp. Lastly, we introduce the tool ELSA, CPU-version BALSA. ELSA tries to compensate a GPU card (which typically contains hundreds to a few thousand cores) by multiple-cores CPUs in a single computing node (typically 2x12 cores). Although GPU is a popular tool in the aspect of high-performance computing, it is costly and requires special maintenance especially on its evolving software environment. ELSA is targeted to be a cost-effective solution for secondary analysis.
Degree	Master of Philosophy
Subject	Nucleotide sequence - Methodology
Dept/Program	Computer Science
Persistent Identifier	http://hdl.handle.net/10722/235921
HKU Library Item ID	b5801690

DC Field	Value	Language
dc.contributor.author	Law, Wai-chun	-
dc.contributor.author	羅維進	-
dc.date.accessioned	2016-11-09T23:27:03Z	-
dc.date.available	2016-11-09T23:27:03Z	-
dc.date.issued	2016	-
dc.identifier.citation	Law, W. [羅維進]. (2016). Efficient analysis solution for DNA short-read sequencing. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.	-
dc.identifier.uri	http://hdl.handle.net/10722/235921	-
dc.description.abstract	In recent years, the demand for DNA sequencing analysis has been boosted with the advance of DNA sequencing technologies; exceeding the capacities of high-end computer servers. This thesis presents integrated software solutions for popular DNA Sequencing analyses, along with implementation and experiments with real data to demonstrate the strength of the solutions over conventional solutions. The first software tool presented is BALSA, which integrates the DNA pairend short reads aligner SOAP3-dp with a newly designed secondary analysis. BALSA finishes 30x Whole-Genome Analysis (WGA) within 6 hours. The well-known pipeline BWA+GATK takes about 20 hours for the same analysis. BALSA’s efficiency is rooted at its fast alignment algorithm and an integrated design that significantly reduces the time spent on file IO. More importantly, experiments show that variant calling accuracy and sensitivity of BALSA are competitive to other existing solutions. The second tool presented is BALSA-Amplicon, which is designed for amplicon sequencing analysis. Unlike WGA, amplicon sequencing data come along with artificial primers, which will contaminate the analysis. A common fix is to trim the reads at the beginning, but this also removes useful data that helps to map the read correctly. BALSA takes advantage of aligning with the primer and only trims it when updating the in-memory alignment information data structure. The sequencing depth of amplicon data could also be several thousands of times of that of WGA data. The data structure has been modified to support the high sequencing depth without degrading the performance. Experiments show BALSA-amplicon takes 20 minutes for calling variants from 3 million of 275bp amplicon short-read pairs. Thirdly, we introduce a short-read aligner SOAP4, targeted on aligning short-read pairs with read length larger than or equal to 150bp (the current standard of high-throughput sequencers like HiSeq 10X). Unlike 100bp reads, the number of mismatches along the reads are generally greater than 2. SOAP3-dp is unable to be aligned quickly by using BWT index solely. SOAP4 highly adapts the seed-and-extend strategy. Experiments, with real data with 250bp length, show that SOAP4 is 8% faster than SOAP3-dp and the sensitivity of SOAP4 is 95.83%, compared to 85.32% of SOAP3-dp. And simulated data experiments show SOAP4 gives competitive accuracy compare with SOAP3-dp. Lastly, we introduce the tool ELSA, CPU-version BALSA. ELSA tries to compensate a GPU card (which typically contains hundreds to a few thousand cores) by multiple-cores CPUs in a single computing node (typically 2x12 cores). Although GPU is a popular tool in the aspect of high-performance computing, it is costly and requires special maintenance especially on its evolving software environment. ELSA is targeted to be a cost-effective solution for secondary analysis.	-
dc.language	eng	-
dc.publisher	The University of Hong Kong (Pokfulam, Hong Kong)	-
dc.relation.ispartof	HKU Theses Online (HKUTO)	-
dc.rights	The author retains all proprietary rights, (such as patent rights) and the right to use in future works.	-
dc.rights	This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.	-
dc.subject.lcsh	Nucleotide sequence - Methodology	-
dc.title	Efficient analysis solution for DNA short-read sequencing	-
dc.type	PG_Thesis	-
dc.identifier.hkul	b5801690	-
dc.description.thesisname	Master of Philosophy	-
dc.description.thesislevel	Master	-
dc.description.thesisdiscipline	Computer Science	-
dc.description.nature	published_or_final_version	-
dc.identifier.doi	10.5353/th_b5801690	-
dc.identifier.mmsid	991020816769703414	-

File Download

Links for fulltext

(May Require Subscription)

Supplementary

postgraduate thesis: Efficient analysis solution for DNA short-read sequencing

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats