File Download
Supplementary
-
Citations:
- Appears in Collections:
postgraduate thesis: Functional annotation, prioritization and enrichment analysis of human regulatory variants
Title | Functional annotation, prioritization and enrichment analysis of human regulatory variants |
---|---|
Authors | |
Advisors | |
Issue Date | 2021 |
Publisher | The University of Hong Kong (Pokfulam, Hong Kong) |
Citation | Yao, H. [姚宏成]. (2021). Functional annotation, prioritization and enrichment analysis of human regulatory variants. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. |
Abstract | Regulatory variants are vital for regulation of gene expression and are involved in the disease pathogenesis and trait development. In the past decade, genome-wide association studies (GWASs) and expression quantitative trait loci (eQTL) studies have identified numerous associated variants and a large proportion of them are located in noncoding regions, indicating their potential role as regulatory variants. However, the precise identification and interpretation of noncoding regulatory variants by experimental validation are costly and labor-intensive, which hampers the illumination of the underlying mechanisms of diseases/traits. Fortunately, functional annotations like histone modification profiles can indicate the existence and potential function of regulatory variants and there is a surge of genomic, transcriptomic and epigenomic profiling studies across diverse tissues/cell types in recent years. As a result, variant annotation becomes a key step in the analysis of regulatory variants. Furthermore, based on the functional annotations, computational methods are developed to perform in silico prediction and prioritization of regulatory variants, while enrichment analysis is applied to a set of variants to determine the informative annotations. In this thesis, we developed two computational methods to facilitate the prediction and enrichment analysis of regulatory variants.
We first used eQTL data from the Genotype-Tissue Expression (GTEx) project as training data, and comprehensively integrated tissue/cell type-specific epigenomic marks and prediction scores from existing tools as predictors to develop a regulatory variant prediction method cepip2. It was built on gradient tree boosting method and consisted of three submodels, a context-dependent model, an organism-level model and an overall model for different scenarios. Critical questions with respect to model constructions were carefully discussed and the constructed models were able to make accurate regulatory potential prediction in a tissue/cell type specific manner. Systematic comparison among submodels as well as existing methods were carried out on multiple independent test datasets, and cepip2 demonstrated superior performance most of the time. As an illustration of application, cepip2 was applied to fine-mapped GWAS summary data of 39 traits/diseases and was demonstrated to be capable of identifying the most relevant tissues/cell types.
Random sampling of variants matching for selected properties is commonly used for null distribution construction in enrichment analysis and negative datasets generation in regulatory variant prediction. However, current tools are inefficient and unable to process large-scale input data. To tackle this problem, novel designs including a data structure with a corresponding index system and a sampling pipeline with a temporary storage algorithm were proposed in this thesis to develop a fast annotation-based matched variant sampling tool, vSampler. By careful benchmark tests, vSampler was shown to be much faster than existing tools and is robust to massive amount of input data. Its applications in enrichment analysis and advantage of having comprehensive matching properties were demonstrated in three usage examples.
In conclusion, two novel computational methods were developed and evaluated in this thesis for the functional annotation, prioritization and enrichment analysis of human regulatory variants. We believe these methods would facilitate the precise interpretation of regulatory variants and their role in the development of complex traits. |
Degree | Doctor of Philosophy |
Subject | Human genetics - Variation Genomics - Data processing |
Dept/Program | Biomedical Sciences |
Persistent Identifier | http://hdl.handle.net/10722/301058 |
DC Field | Value | Language |
---|---|---|
dc.contributor.advisor | Sham, PC | - |
dc.contributor.advisor | Xia, Z | - |
dc.contributor.author | Yao, Hongcheng | - |
dc.contributor.author | 姚宏成 | - |
dc.date.accessioned | 2021-07-16T14:38:44Z | - |
dc.date.available | 2021-07-16T14:38:44Z | - |
dc.date.issued | 2021 | - |
dc.identifier.citation | Yao, H. [姚宏成]. (2021). Functional annotation, prioritization and enrichment analysis of human regulatory variants. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. | - |
dc.identifier.uri | http://hdl.handle.net/10722/301058 | - |
dc.description.abstract | Regulatory variants are vital for regulation of gene expression and are involved in the disease pathogenesis and trait development. In the past decade, genome-wide association studies (GWASs) and expression quantitative trait loci (eQTL) studies have identified numerous associated variants and a large proportion of them are located in noncoding regions, indicating their potential role as regulatory variants. However, the precise identification and interpretation of noncoding regulatory variants by experimental validation are costly and labor-intensive, which hampers the illumination of the underlying mechanisms of diseases/traits. Fortunately, functional annotations like histone modification profiles can indicate the existence and potential function of regulatory variants and there is a surge of genomic, transcriptomic and epigenomic profiling studies across diverse tissues/cell types in recent years. As a result, variant annotation becomes a key step in the analysis of regulatory variants. Furthermore, based on the functional annotations, computational methods are developed to perform in silico prediction and prioritization of regulatory variants, while enrichment analysis is applied to a set of variants to determine the informative annotations. In this thesis, we developed two computational methods to facilitate the prediction and enrichment analysis of regulatory variants. We first used eQTL data from the Genotype-Tissue Expression (GTEx) project as training data, and comprehensively integrated tissue/cell type-specific epigenomic marks and prediction scores from existing tools as predictors to develop a regulatory variant prediction method cepip2. It was built on gradient tree boosting method and consisted of three submodels, a context-dependent model, an organism-level model and an overall model for different scenarios. Critical questions with respect to model constructions were carefully discussed and the constructed models were able to make accurate regulatory potential prediction in a tissue/cell type specific manner. Systematic comparison among submodels as well as existing methods were carried out on multiple independent test datasets, and cepip2 demonstrated superior performance most of the time. As an illustration of application, cepip2 was applied to fine-mapped GWAS summary data of 39 traits/diseases and was demonstrated to be capable of identifying the most relevant tissues/cell types. Random sampling of variants matching for selected properties is commonly used for null distribution construction in enrichment analysis and negative datasets generation in regulatory variant prediction. However, current tools are inefficient and unable to process large-scale input data. To tackle this problem, novel designs including a data structure with a corresponding index system and a sampling pipeline with a temporary storage algorithm were proposed in this thesis to develop a fast annotation-based matched variant sampling tool, vSampler. By careful benchmark tests, vSampler was shown to be much faster than existing tools and is robust to massive amount of input data. Its applications in enrichment analysis and advantage of having comprehensive matching properties were demonstrated in three usage examples. In conclusion, two novel computational methods were developed and evaluated in this thesis for the functional annotation, prioritization and enrichment analysis of human regulatory variants. We believe these methods would facilitate the precise interpretation of regulatory variants and their role in the development of complex traits. | - |
dc.language | eng | - |
dc.publisher | The University of Hong Kong (Pokfulam, Hong Kong) | - |
dc.relation.ispartof | HKU Theses Online (HKUTO) | - |
dc.rights | The author retains all proprietary rights, (such as patent rights) and the right to use in future works. | - |
dc.rights | This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. | - |
dc.subject.lcsh | Human genetics - Variation | - |
dc.subject.lcsh | Genomics - Data processing | - |
dc.title | Functional annotation, prioritization and enrichment analysis of human regulatory variants | - |
dc.type | PG_Thesis | - |
dc.description.thesisname | Doctor of Philosophy | - |
dc.description.thesislevel | Doctoral | - |
dc.description.thesisdiscipline | Biomedical Sciences | - |
dc.description.nature | published_or_final_version | - |
dc.date.hkucongregation | 2021 | - |
dc.identifier.mmsid | 991044390191203414 | - |