File Download
  Links for fulltext
     (May Require Subscription)
Supplementary

postgraduate thesis: A fast and accurate model to detect germline SNPs and somatic SNVs with high-throughput sequencing

TitleA fast and accurate model to detect germline SNPs and somatic SNVs with high-throughput sequencing
Authors
Advisors
Advisor(s):Wang, JJLam, TW
Issue Date2014
PublisherThe University of Hong Kong (Pokfulam, Hong Kong)
Citation
Wang, W. [王煒欣]. (2014). A fast and accurate model to detect germline SNPs and somatic SNVs with high-throughput sequencing. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. Retrieved from http://dx.doi.org/10.5353/th_b5185953
AbstractThe rapid development of high-throughput sequencing technology provides a new chance to extend the scale and resolution of genomic research. How to efficiently and accurately call genetic variants in single base level (germline single nucleotide polymorphisms (SNPs) or somatic single nucleotide variants (SNVs)) is the fundamental challenge in sequencing data analysis, because these variants reported to influence transcriptional regulation, alternative splicing, non-coding RNA regulation and protein coding. Many applications have been developed to tackle this challenge. However, the shallow depth and cellular heterogeneity make those tools cannot attain satisfactory accuracy, and the huge volume of sequencing data itself cause this process inefficient. In this dissertation, firstly the performance of prevalent reads aligners and SNP callers for second-generation sequencing (SGS) is evaluated. And due to the high GC-content, the significantly lower coverage and poorer SNP calling performance in the regulatory regions of human genome by SGS is investigated. To enhance the capability to call SNPs, especially within the lower-depth regions, a fast and accurate SNP detection (FaSD) program that uses a binomial distribution based algorithm and a mutation probability is proposed. Based on the comparison with popular software and benchmarked by SNP arrays and high-depth sequencing data, it is demonstrated that FaSD has the best SNP calling accuracy in the aspects of genotype concordance rate and AUC. Furthermore, FaSD can finish SNP calling within four hours for 10X human genome SGS data on a standard desktop computer. Lastly, combined with the joint genotype likelihoods, an updated version of FaSD is proposed to call the cancerous somatic SNVs between paired tumor and normal samples. With extensive assessments on various types of cancer, it is demonstrated that no matter benchmarked by the known somatic SNVs and germline SNPs from database, or somatic SNVs called from higher-depth data, FaSD-somatic has the best overall performance. Inherited and improved from FaSD, FaSD-somatic is also the fastest somatic SNV caller among current programs, and can finish calling somatic mutations within 14 hours for 50X paired tumor and normal samples on normal server.
DegreeDoctor of Philosophy
SubjectChromosome polymorphism
Nucleotide sequence
Dept/ProgramBiochemistry
Persistent Identifierhttp://hdl.handle.net/10722/197115

 

DC FieldValueLanguage
dc.contributor.advisorWang, JJ-
dc.contributor.advisorLam, TW-
dc.contributor.authorWang, Weixin-
dc.contributor.author王煒欣-
dc.date.accessioned2014-05-07T23:15:28Z-
dc.date.available2014-05-07T23:15:28Z-
dc.date.issued2014-
dc.identifier.citationWang, W. [王煒欣]. (2014). A fast and accurate model to detect germline SNPs and somatic SNVs with high-throughput sequencing. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. Retrieved from http://dx.doi.org/10.5353/th_b5185953-
dc.identifier.urihttp://hdl.handle.net/10722/197115-
dc.description.abstractThe rapid development of high-throughput sequencing technology provides a new chance to extend the scale and resolution of genomic research. How to efficiently and accurately call genetic variants in single base level (germline single nucleotide polymorphisms (SNPs) or somatic single nucleotide variants (SNVs)) is the fundamental challenge in sequencing data analysis, because these variants reported to influence transcriptional regulation, alternative splicing, non-coding RNA regulation and protein coding. Many applications have been developed to tackle this challenge. However, the shallow depth and cellular heterogeneity make those tools cannot attain satisfactory accuracy, and the huge volume of sequencing data itself cause this process inefficient. In this dissertation, firstly the performance of prevalent reads aligners and SNP callers for second-generation sequencing (SGS) is evaluated. And due to the high GC-content, the significantly lower coverage and poorer SNP calling performance in the regulatory regions of human genome by SGS is investigated. To enhance the capability to call SNPs, especially within the lower-depth regions, a fast and accurate SNP detection (FaSD) program that uses a binomial distribution based algorithm and a mutation probability is proposed. Based on the comparison with popular software and benchmarked by SNP arrays and high-depth sequencing data, it is demonstrated that FaSD has the best SNP calling accuracy in the aspects of genotype concordance rate and AUC. Furthermore, FaSD can finish SNP calling within four hours for 10X human genome SGS data on a standard desktop computer. Lastly, combined with the joint genotype likelihoods, an updated version of FaSD is proposed to call the cancerous somatic SNVs between paired tumor and normal samples. With extensive assessments on various types of cancer, it is demonstrated that no matter benchmarked by the known somatic SNVs and germline SNPs from database, or somatic SNVs called from higher-depth data, FaSD-somatic has the best overall performance. Inherited and improved from FaSD, FaSD-somatic is also the fastest somatic SNV caller among current programs, and can finish calling somatic mutations within 14 hours for 50X paired tumor and normal samples on normal server.-
dc.languageeng-
dc.publisherThe University of Hong Kong (Pokfulam, Hong Kong)-
dc.relation.ispartofHKU Theses Online (HKUTO)-
dc.rightsThe author retains all proprietary rights, (such as patent rights) and the right to use in future works.-
dc.rightsCreative Commons: Attribution 3.0 Hong Kong License-
dc.subject.lcshChromosome polymorphism-
dc.subject.lcshNucleotide sequence-
dc.titleA fast and accurate model to detect germline SNPs and somatic SNVs with high-throughput sequencing-
dc.typePG_Thesis-
dc.identifier.hkulb5185953-
dc.description.thesisnameDoctor of Philosophy-
dc.description.thesislevelDoctoral-
dc.description.thesisdisciplineBiochemistry-
dc.description.naturepublished_or_final_version-
dc.identifier.doi10.5353/th_b5185953-

Export via OAI-PMH Interface in XML Formats


OR


Export to Other Non-XML Formats