File Download
Supplementary

postgraduate thesis: Big-data computational methods for studying molecular evolution : from accurate genome assembly to phylogenetic analysis and data integration

TitleBig-data computational methods for studying molecular evolution : from accurate genome assembly to phylogenetic analysis and data integration
Authors
Issue Date2017
PublisherThe University of Hong Kong (Pokfulam, Hong Kong)
Citation
Yu, G. [余光创]. (2017). Big-data computational methods for studying molecular evolution : from accurate genome assembly to phylogenetic analysis and data integration. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
AbstractWith the advance of next generation sequencing (NGS) technologies, genetic data produced for research are booming. Molecular evolution, the discipline that studies genetic changes over time and among organisms, has escalated alongside the rise of such ’big-data’. In the infectious disease research, studying molecular evolution of pathogens plays an important role to investigate disease origin, transmission and evolution in outbreaks, because the infection and transmission processes leave footprints on the pathogen’s genomes. While NGS technologies have enabled fast and massive acquisition of pathogen genome sequences, such big-data present many computational challenges to achieve evolutionary analyses with high efficiency and precision. This thesis research aims to provide new computational methods and tools to address some of these challenges. Phylogenetic tree (or phylogeny) is a fundamental framework for many analysis methods studying molecular evolution in different aspects and statistics, such as molecular clock and selective pressure inferences. Comparing the results from different analyses as well as with other phenotype data about the studied organisms obtained from experiments or other investigations, will likely generate more comprehensive understanding of the organisms and new hypothesis of genotype-phenotype association. A programmable platform for such data integration and analysis is needed for large data sets. Here, an R package treeio was developed to robustly import phylogenetic-related data from various analysis programs and sources. Another R package, ggtree, was developed to integrate these imported data for high-level analysis and efficient annotation of large complex phylogenetic trees. As genetic sequences are accumulating ever faster with NGS technologies, rebuilding large phylogenetic trees from scratch to include new sequences became inefficient as the evolutionary relationships of pre-existing sequences keep repeatedly calculated. TIPars was proposed to efficiently insert a new sequence to an existing tree using maximum parsimony criterion with pre-computed ancestral sequences. Simulation studies showed that TIPars had generally higher accuracy and speed compared with other existing maximum likelihood methods such as pplacer and EPA. Most popular NGS technologies generate short sequencing reads that require assembling into complete/longer biological sequences for downstream molecular evolution analyses. Conventional assembly methods have limitations to deal with the reads from the samples containing multiple strains of organisms, which are commonly observed in pathogen surveillance such as avian influenza A virus (AIV). This issue was addressed by a novel method (denoted as PAM) developed to utilize phylogeny to guide the genome assembly. PAM was shown capable to distinguish short sequencing reads from closely related pathogens such as co-infecting AIVs and hence assemble the genome sequences with improved accuracy and coverage, compared to other existing methods. This thesis developed several computational methods and tools to address issues in studying molecular evolution at big-data era, including genome assembly with strain-level resolution, updating the large phylogeny with new sequences, and data integration, analysis and annotation on large phylogeny. It is anticipated that these methods will facilitate the ’genomic surveillance’ of viral pathogens that involves joint analyses of the large amount of genetic sequences with their related epidemiological and virological data.
DegreeDoctor of Philosophy
SubjectMolecular evolution
Genomics
Big data
Dept/ProgramPublic Health
Persistent Identifierhttp://hdl.handle.net/10722/261546

 

DC FieldValueLanguage
dc.contributor.authorYu, Guangchuang-
dc.contributor.author余光创-
dc.date.accessioned2018-09-20T06:44:12Z-
dc.date.available2018-09-20T06:44:12Z-
dc.date.issued2017-
dc.identifier.citationYu, G. [余光创]. (2017). Big-data computational methods for studying molecular evolution : from accurate genome assembly to phylogenetic analysis and data integration. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.-
dc.identifier.urihttp://hdl.handle.net/10722/261546-
dc.description.abstractWith the advance of next generation sequencing (NGS) technologies, genetic data produced for research are booming. Molecular evolution, the discipline that studies genetic changes over time and among organisms, has escalated alongside the rise of such ’big-data’. In the infectious disease research, studying molecular evolution of pathogens plays an important role to investigate disease origin, transmission and evolution in outbreaks, because the infection and transmission processes leave footprints on the pathogen’s genomes. While NGS technologies have enabled fast and massive acquisition of pathogen genome sequences, such big-data present many computational challenges to achieve evolutionary analyses with high efficiency and precision. This thesis research aims to provide new computational methods and tools to address some of these challenges. Phylogenetic tree (or phylogeny) is a fundamental framework for many analysis methods studying molecular evolution in different aspects and statistics, such as molecular clock and selective pressure inferences. Comparing the results from different analyses as well as with other phenotype data about the studied organisms obtained from experiments or other investigations, will likely generate more comprehensive understanding of the organisms and new hypothesis of genotype-phenotype association. A programmable platform for such data integration and analysis is needed for large data sets. Here, an R package treeio was developed to robustly import phylogenetic-related data from various analysis programs and sources. Another R package, ggtree, was developed to integrate these imported data for high-level analysis and efficient annotation of large complex phylogenetic trees. As genetic sequences are accumulating ever faster with NGS technologies, rebuilding large phylogenetic trees from scratch to include new sequences became inefficient as the evolutionary relationships of pre-existing sequences keep repeatedly calculated. TIPars was proposed to efficiently insert a new sequence to an existing tree using maximum parsimony criterion with pre-computed ancestral sequences. Simulation studies showed that TIPars had generally higher accuracy and speed compared with other existing maximum likelihood methods such as pplacer and EPA. Most popular NGS technologies generate short sequencing reads that require assembling into complete/longer biological sequences for downstream molecular evolution analyses. Conventional assembly methods have limitations to deal with the reads from the samples containing multiple strains of organisms, which are commonly observed in pathogen surveillance such as avian influenza A virus (AIV). This issue was addressed by a novel method (denoted as PAM) developed to utilize phylogeny to guide the genome assembly. PAM was shown capable to distinguish short sequencing reads from closely related pathogens such as co-infecting AIVs and hence assemble the genome sequences with improved accuracy and coverage, compared to other existing methods. This thesis developed several computational methods and tools to address issues in studying molecular evolution at big-data era, including genome assembly with strain-level resolution, updating the large phylogeny with new sequences, and data integration, analysis and annotation on large phylogeny. It is anticipated that these methods will facilitate the ’genomic surveillance’ of viral pathogens that involves joint analyses of the large amount of genetic sequences with their related epidemiological and virological data. -
dc.languageeng-
dc.publisherThe University of Hong Kong (Pokfulam, Hong Kong)-
dc.relation.ispartofHKU Theses Online (HKUTO)-
dc.rightsThe author retains all proprietary rights, (such as patent rights) and the right to use in future works.-
dc.rightsThis work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.-
dc.subject.lcshMolecular evolution-
dc.subject.lcshGenomics-
dc.subject.lcshBig data-
dc.titleBig-data computational methods for studying molecular evolution : from accurate genome assembly to phylogenetic analysis and data integration-
dc.typePG_Thesis-
dc.description.thesisnameDoctor of Philosophy-
dc.description.thesislevelDoctoral-
dc.description.thesisdisciplinePublic Health-
dc.description.naturepublished_or_final_version-
dc.date.hkucongregation2017-
dc.identifier.mmsid991044040577403414-

Export via OAI-PMH Interface in XML Formats


OR


Export to Other Non-XML Formats