File Download
Supplementary

postgraduate thesis: Efficient analysis solution for DNA short-read sequencing

TitleEfficient analysis solution for DNA short-read sequencing
Authors
Issue Date2016
PublisherThe University of Hong Kong (Pokfulam, Hong Kong)
Citation
Law, W. [羅維進]. (2016). Efficient analysis solution for DNA short-read sequencing. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
AbstractIn recent years, the demand for DNA sequencing analysis has been boosted with the advance of DNA sequencing technologies; exceeding the capacities of high-end computer servers. This thesis presents integrated software solutions for popular DNA Sequencing analyses, along with implementation and experiments with real data to demonstrate the strength of the solutions over conventional solutions. The first software tool presented is BALSA, which integrates the DNA pairend short reads aligner SOAP3-dp with a newly designed secondary analysis. BALSA finishes 30x Whole-Genome Analysis (WGA) within 6 hours. The well-known pipeline BWA+GATK takes about 20 hours for the same analysis. BALSA’s efficiency is rooted at its fast alignment algorithm and an integrated design that significantly reduces the time spent on file IO. More importantly, experiments show that variant calling accuracy and sensitivity of BALSA are competitive to other existing solutions. The second tool presented is BALSA-Amplicon, which is designed for amplicon sequencing analysis. Unlike WGA, amplicon sequencing data come along with artificial primers, which will contaminate the analysis. A common fix is to trim the reads at the beginning, but this also removes useful data that helps to map the read correctly. BALSA takes advantage of aligning with the primer and only trims it when updating the in-memory alignment information data structure. The sequencing depth of amplicon data could also be several thousands of times of that of WGA data. The data structure has been modified to support the high sequencing depth without degrading the performance. Experiments show BALSA-amplicon takes 20 minutes for calling variants from 3 million of 275bp amplicon short-read pairs. Thirdly, we introduce a short-read aligner SOAP4, targeted on aligning short-read pairs with read length larger than or equal to 150bp (the current standard of high-throughput sequencers like HiSeq 10X). Unlike 100bp reads, the number of mismatches along the reads are generally greater than 2. SOAP3-dp is unable to be aligned quickly by using BWT index solely. SOAP4 highly adapts the seed-and-extend strategy. Experiments, with real data with 250bp length, show that SOAP4 is 8% faster than SOAP3-dp and the sensitivity of SOAP4 is 95.83%, compared to 85.32% of SOAP3-dp. And simulated data experiments show SOAP4 gives competitive accuracy compare with SOAP3-dp. Lastly, we introduce the tool ELSA, CPU-version BALSA. ELSA tries to compensate a GPU card (which typically contains hundreds to a few thousand cores) by multiple-cores CPUs in a single computing node (typically 2x12 cores). Although GPU is a popular tool in the aspect of high-performance computing, it is costly and requires special maintenance especially on its evolving software environment. ELSA is targeted to be a cost-effective solution for secondary analysis.
DegreeMaster of Philosophy
SubjectNucleotide sequence - Methodology
Dept/ProgramComputer Science
Persistent Identifierhttp://hdl.handle.net/10722/235921
HKU Library Item IDb5801690

 

DC FieldValueLanguage
dc.contributor.authorLaw, Wai-chun-
dc.contributor.author羅維進-
dc.date.accessioned2016-11-09T23:27:03Z-
dc.date.available2016-11-09T23:27:03Z-
dc.date.issued2016-
dc.identifier.citationLaw, W. [羅維進]. (2016). Efficient analysis solution for DNA short-read sequencing. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.-
dc.identifier.urihttp://hdl.handle.net/10722/235921-
dc.description.abstractIn recent years, the demand for DNA sequencing analysis has been boosted with the advance of DNA sequencing technologies; exceeding the capacities of high-end computer servers. This thesis presents integrated software solutions for popular DNA Sequencing analyses, along with implementation and experiments with real data to demonstrate the strength of the solutions over conventional solutions. The first software tool presented is BALSA, which integrates the DNA pairend short reads aligner SOAP3-dp with a newly designed secondary analysis. BALSA finishes 30x Whole-Genome Analysis (WGA) within 6 hours. The well-known pipeline BWA+GATK takes about 20 hours for the same analysis. BALSA’s efficiency is rooted at its fast alignment algorithm and an integrated design that significantly reduces the time spent on file IO. More importantly, experiments show that variant calling accuracy and sensitivity of BALSA are competitive to other existing solutions. The second tool presented is BALSA-Amplicon, which is designed for amplicon sequencing analysis. Unlike WGA, amplicon sequencing data come along with artificial primers, which will contaminate the analysis. A common fix is to trim the reads at the beginning, but this also removes useful data that helps to map the read correctly. BALSA takes advantage of aligning with the primer and only trims it when updating the in-memory alignment information data structure. The sequencing depth of amplicon data could also be several thousands of times of that of WGA data. The data structure has been modified to support the high sequencing depth without degrading the performance. Experiments show BALSA-amplicon takes 20 minutes for calling variants from 3 million of 275bp amplicon short-read pairs. Thirdly, we introduce a short-read aligner SOAP4, targeted on aligning short-read pairs with read length larger than or equal to 150bp (the current standard of high-throughput sequencers like HiSeq 10X). Unlike 100bp reads, the number of mismatches along the reads are generally greater than 2. SOAP3-dp is unable to be aligned quickly by using BWT index solely. SOAP4 highly adapts the seed-and-extend strategy. Experiments, with real data with 250bp length, show that SOAP4 is 8% faster than SOAP3-dp and the sensitivity of SOAP4 is 95.83%, compared to 85.32% of SOAP3-dp. And simulated data experiments show SOAP4 gives competitive accuracy compare with SOAP3-dp. Lastly, we introduce the tool ELSA, CPU-version BALSA. ELSA tries to compensate a GPU card (which typically contains hundreds to a few thousand cores) by multiple-cores CPUs in a single computing node (typically 2x12 cores). Although GPU is a popular tool in the aspect of high-performance computing, it is costly and requires special maintenance especially on its evolving software environment. ELSA is targeted to be a cost-effective solution for secondary analysis.-
dc.languageeng-
dc.publisherThe University of Hong Kong (Pokfulam, Hong Kong)-
dc.relation.ispartofHKU Theses Online (HKUTO)-
dc.rightsThe author retains all proprietary rights, (such as patent rights) and the right to use in future works.-
dc.rightsThis work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.-
dc.subject.lcshNucleotide sequence - Methodology-
dc.titleEfficient analysis solution for DNA short-read sequencing-
dc.typePG_Thesis-
dc.identifier.hkulb5801690-
dc.description.thesisnameMaster of Philosophy-
dc.description.thesislevelMaster-
dc.description.thesisdisciplineComputer Science-
dc.description.naturepublished_or_final_version-

Export via OAI-PMH Interface in XML Formats


OR


Export to Other Non-XML Formats