File Download
Supplementary
-
Citations:
- Appears in Collections:
postgraduate thesis: Efficient analysis solution for DNA short-read sequencing
Title | Efficient analysis solution for DNA short-read sequencing |
---|---|
Authors | |
Issue Date | 2016 |
Publisher | The University of Hong Kong (Pokfulam, Hong Kong) |
Citation | Law, W. [羅維進]. (2016). Efficient analysis solution for DNA short-read sequencing. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. |
Abstract | In recent years, the demand for DNA sequencing analysis has been boosted with the advance of DNA sequencing technologies; exceeding the capacities of high-end computer servers.
This thesis presents integrated software solutions for popular DNA Sequencing analyses, along with implementation and experiments with real data to demonstrate the strength of the solutions over conventional solutions.
The first software tool presented is BALSA, which integrates the DNA pairend short reads aligner SOAP3-dp with a newly designed secondary analysis. BALSA finishes 30x Whole-Genome Analysis (WGA) within 6 hours. The well-known pipeline BWA+GATK takes about 20 hours for the same analysis. BALSA’s efficiency is rooted at its fast alignment algorithm and an integrated design that significantly reduces the time spent on file IO. More importantly, experiments show that variant calling accuracy and sensitivity of BALSA are competitive to other existing solutions.
The second tool presented is BALSA-Amplicon, which is designed for amplicon sequencing analysis. Unlike WGA, amplicon sequencing data come along with artificial primers, which will contaminate the analysis. A common fix is to trim the reads at the beginning, but this also removes useful data that helps to map the read correctly.
BALSA takes advantage of aligning with the primer and only trims it when updating the in-memory alignment information data structure. The sequencing depth of amplicon data could also be several thousands of times of that of WGA data. The data structure has been modified to support the high sequencing depth without degrading the performance. Experiments show BALSA-amplicon takes 20 minutes for calling variants from 3 million of 275bp amplicon short-read pairs.
Thirdly, we introduce a short-read aligner SOAP4, targeted on aligning short-read pairs with read length larger than or equal to 150bp (the current standard of high-throughput sequencers like HiSeq 10X). Unlike 100bp reads, the number of mismatches along the reads are generally greater than 2. SOAP3-dp is unable to be aligned quickly by using BWT index solely. SOAP4 highly adapts the seed-and-extend strategy. Experiments, with real data with 250bp length, show that SOAP4 is 8% faster than SOAP3-dp and the sensitivity of SOAP4 is 95.83%, compared to 85.32% of SOAP3-dp. And simulated data experiments show SOAP4 gives competitive accuracy compare with SOAP3-dp.
Lastly, we introduce the tool ELSA, CPU-version BALSA. ELSA tries to compensate a GPU card (which typically contains hundreds to a few thousand cores) by multiple-cores CPUs in a single computing node (typically 2x12 cores). Although GPU is a popular tool in the aspect of high-performance computing, it is costly and requires special maintenance especially on its evolving software environment. ELSA is targeted to be a cost-effective solution for secondary analysis. |
Degree | Master of Philosophy |
Subject | Nucleotide sequence - Methodology |
Dept/Program | Computer Science |
Persistent Identifier | http://hdl.handle.net/10722/235921 |
HKU Library Item ID | b5801690 |
DC Field | Value | Language |
---|---|---|
dc.contributor.author | Law, Wai-chun | - |
dc.contributor.author | 羅維進 | - |
dc.date.accessioned | 2016-11-09T23:27:03Z | - |
dc.date.available | 2016-11-09T23:27:03Z | - |
dc.date.issued | 2016 | - |
dc.identifier.citation | Law, W. [羅維進]. (2016). Efficient analysis solution for DNA short-read sequencing. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. | - |
dc.identifier.uri | http://hdl.handle.net/10722/235921 | - |
dc.description.abstract | In recent years, the demand for DNA sequencing analysis has been boosted with the advance of DNA sequencing technologies; exceeding the capacities of high-end computer servers. This thesis presents integrated software solutions for popular DNA Sequencing analyses, along with implementation and experiments with real data to demonstrate the strength of the solutions over conventional solutions. The first software tool presented is BALSA, which integrates the DNA pairend short reads aligner SOAP3-dp with a newly designed secondary analysis. BALSA finishes 30x Whole-Genome Analysis (WGA) within 6 hours. The well-known pipeline BWA+GATK takes about 20 hours for the same analysis. BALSA’s efficiency is rooted at its fast alignment algorithm and an integrated design that significantly reduces the time spent on file IO. More importantly, experiments show that variant calling accuracy and sensitivity of BALSA are competitive to other existing solutions. The second tool presented is BALSA-Amplicon, which is designed for amplicon sequencing analysis. Unlike WGA, amplicon sequencing data come along with artificial primers, which will contaminate the analysis. A common fix is to trim the reads at the beginning, but this also removes useful data that helps to map the read correctly. BALSA takes advantage of aligning with the primer and only trims it when updating the in-memory alignment information data structure. The sequencing depth of amplicon data could also be several thousands of times of that of WGA data. The data structure has been modified to support the high sequencing depth without degrading the performance. Experiments show BALSA-amplicon takes 20 minutes for calling variants from 3 million of 275bp amplicon short-read pairs. Thirdly, we introduce a short-read aligner SOAP4, targeted on aligning short-read pairs with read length larger than or equal to 150bp (the current standard of high-throughput sequencers like HiSeq 10X). Unlike 100bp reads, the number of mismatches along the reads are generally greater than 2. SOAP3-dp is unable to be aligned quickly by using BWT index solely. SOAP4 highly adapts the seed-and-extend strategy. Experiments, with real data with 250bp length, show that SOAP4 is 8% faster than SOAP3-dp and the sensitivity of SOAP4 is 95.83%, compared to 85.32% of SOAP3-dp. And simulated data experiments show SOAP4 gives competitive accuracy compare with SOAP3-dp. Lastly, we introduce the tool ELSA, CPU-version BALSA. ELSA tries to compensate a GPU card (which typically contains hundreds to a few thousand cores) by multiple-cores CPUs in a single computing node (typically 2x12 cores). Although GPU is a popular tool in the aspect of high-performance computing, it is costly and requires special maintenance especially on its evolving software environment. ELSA is targeted to be a cost-effective solution for secondary analysis. | - |
dc.language | eng | - |
dc.publisher | The University of Hong Kong (Pokfulam, Hong Kong) | - |
dc.relation.ispartof | HKU Theses Online (HKUTO) | - |
dc.rights | The author retains all proprietary rights, (such as patent rights) and the right to use in future works. | - |
dc.rights | This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. | - |
dc.subject.lcsh | Nucleotide sequence - Methodology | - |
dc.title | Efficient analysis solution for DNA short-read sequencing | - |
dc.type | PG_Thesis | - |
dc.identifier.hkul | b5801690 | - |
dc.description.thesisname | Master of Philosophy | - |
dc.description.thesislevel | Master | - |
dc.description.thesisdiscipline | Computer Science | - |
dc.description.nature | published_or_final_version | - |
dc.identifier.doi | 10.5353/th_b5801690 | - |
dc.identifier.mmsid | 991020816769703414 | - |