File Download
Supplementary
-
Citations:
- Appears in Collections:
postgraduate thesis: Development of bioinformatic tools for enhanced prediction and variable selection in genetic studies
Title | Development of bioinformatic tools for enhanced prediction and variable selection in genetic studies |
---|---|
Authors | |
Issue Date | 2024 |
Publisher | The University of Hong Kong (Pokfulam, Hong Kong) |
Citation | Tian, P. [田培馨]. (2024). Development of bioinformatic tools for enhanced prediction and variable selection in genetic studies. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. |
Abstract | The thesis encompasses three sections focused on addressing two common topics in genetic studies: variable selection and prediction.
It is of scientific interest to identify candidate genes associated with complex traits to understand the underlying biological mechanisms in genetic studies. Nowadays, there is a wealth of public genomic data, and the number of genes collected is significant. Traditional methods usually provide a guarantee of false discovery rate (FDR) control asymptotically. However, genetic studies often involve thousands of genes, while only a limited number of samples are available. To address this issue and fully utilize public resources, we introduce a statistical tool named knockoff inference, which generates fake copies of original variables and builds a data-dependent threshold aimed at controlling the FDR in finite-sample settings. The original variables are regarded as random, devoid of dependence on any specific model. In this thesis, we propose a novel method named Grace-AKO, which integrates gene-gene interaction networks as graphical structures and adopts the concept of knockoff inference to achieve finite-sample FDR control. Graph-constrained estimation (Grace) offers a novel biological insight on variable selection for genomic data, leveraging gene-gene interaction networks. The performance of Grace-AKO is demonstrated by simulation studies and real data analysis for prostate-specific antigen (PSA) level data from The Cancer Genome Atlas (TCGA) and graphical structure from the Kyoto Encyclopedia of Genes and Genomes (KEGG).
Moreover, in high-dimensional mediation analysis, identifying mediators usually plays an important role in elucidating the causal relationship from exposure variables to response variables through these mediators. Specifically, DNA methylation CpG sites are usually served as mediators, potentially playing an intermediate role in the pathway from smoking status to lung cancer survival outcomes. Traditional high-dimensional mediation statistical methods applied to survival outcomes exhibit limited ability in dealing with finite samples while ensuring FDR control. In this thesis, we introduce a statistical method termed CoxMKF, which extends multiple knockoff inference to high-dimensional mediation analysis with an epigenetic survival outcome. CoxMKF introduces two novel feature statistics to expand the applicability of knockoff inference to survival models. The performance of CoxMKF is extensively examined through simulation studies and real data analysis, particularly focusing on lung cancer survival data from TCGA.
Nowadays, there are massive and diverse datasets available, serving various purposes. One of the primary objectives is to fully integrate this information to improve prediction accuracy. Notably, there is a wealth of publicly available genome-wide association study (GWAS) data for the European population, while data for non-European populations are relatively limited. Traditional genetic risk models often underperform when trained solely on non-European population data. To address this challenge, this thesis introduces a novel statistical method called TL-Multi. TL-Multi is inspired by transfer learning, seeking to extract knowledge from informative populations to correct prediction biases in target populations. The superior performance of TL-Multi is demonstrated through simulations conducted across a wide range of genetic correlations between informative and target populations. In the real data analysis section, TL-Multi is applied to various complex traits, demonstrating its higher predictive accuracy compared to single-population methods. |
Degree | Doctor of Philosophy |
Subject | Genetics - Statistical methods Bioinformatics |
Dept/Program | Statistics and Actuarial Science |
Persistent Identifier | http://hdl.handle.net/10722/353380 |
DC Field | Value | Language |
---|---|---|
dc.contributor.author | Tian, Peixin | - |
dc.contributor.author | 田培馨 | - |
dc.date.accessioned | 2025-01-17T09:46:11Z | - |
dc.date.available | 2025-01-17T09:46:11Z | - |
dc.date.issued | 2024 | - |
dc.identifier.citation | Tian, P. [田培馨]. (2024). Development of bioinformatic tools for enhanced prediction and variable selection in genetic studies. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. | - |
dc.identifier.uri | http://hdl.handle.net/10722/353380 | - |
dc.description.abstract | The thesis encompasses three sections focused on addressing two common topics in genetic studies: variable selection and prediction. It is of scientific interest to identify candidate genes associated with complex traits to understand the underlying biological mechanisms in genetic studies. Nowadays, there is a wealth of public genomic data, and the number of genes collected is significant. Traditional methods usually provide a guarantee of false discovery rate (FDR) control asymptotically. However, genetic studies often involve thousands of genes, while only a limited number of samples are available. To address this issue and fully utilize public resources, we introduce a statistical tool named knockoff inference, which generates fake copies of original variables and builds a data-dependent threshold aimed at controlling the FDR in finite-sample settings. The original variables are regarded as random, devoid of dependence on any specific model. In this thesis, we propose a novel method named Grace-AKO, which integrates gene-gene interaction networks as graphical structures and adopts the concept of knockoff inference to achieve finite-sample FDR control. Graph-constrained estimation (Grace) offers a novel biological insight on variable selection for genomic data, leveraging gene-gene interaction networks. The performance of Grace-AKO is demonstrated by simulation studies and real data analysis for prostate-specific antigen (PSA) level data from The Cancer Genome Atlas (TCGA) and graphical structure from the Kyoto Encyclopedia of Genes and Genomes (KEGG). Moreover, in high-dimensional mediation analysis, identifying mediators usually plays an important role in elucidating the causal relationship from exposure variables to response variables through these mediators. Specifically, DNA methylation CpG sites are usually served as mediators, potentially playing an intermediate role in the pathway from smoking status to lung cancer survival outcomes. Traditional high-dimensional mediation statistical methods applied to survival outcomes exhibit limited ability in dealing with finite samples while ensuring FDR control. In this thesis, we introduce a statistical method termed CoxMKF, which extends multiple knockoff inference to high-dimensional mediation analysis with an epigenetic survival outcome. CoxMKF introduces two novel feature statistics to expand the applicability of knockoff inference to survival models. The performance of CoxMKF is extensively examined through simulation studies and real data analysis, particularly focusing on lung cancer survival data from TCGA. Nowadays, there are massive and diverse datasets available, serving various purposes. One of the primary objectives is to fully integrate this information to improve prediction accuracy. Notably, there is a wealth of publicly available genome-wide association study (GWAS) data for the European population, while data for non-European populations are relatively limited. Traditional genetic risk models often underperform when trained solely on non-European population data. To address this challenge, this thesis introduces a novel statistical method called TL-Multi. TL-Multi is inspired by transfer learning, seeking to extract knowledge from informative populations to correct prediction biases in target populations. The superior performance of TL-Multi is demonstrated through simulations conducted across a wide range of genetic correlations between informative and target populations. In the real data analysis section, TL-Multi is applied to various complex traits, demonstrating its higher predictive accuracy compared to single-population methods. | - |
dc.language | eng | - |
dc.publisher | The University of Hong Kong (Pokfulam, Hong Kong) | - |
dc.relation.ispartof | HKU Theses Online (HKUTO) | - |
dc.rights | The author retains all proprietary rights, (such as patent rights) and the right to use in future works. | - |
dc.rights | This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. | - |
dc.subject.lcsh | Genetics - Statistical methods | - |
dc.subject.lcsh | Bioinformatics | - |
dc.title | Development of bioinformatic tools for enhanced prediction and variable selection in genetic studies | - |
dc.type | PG_Thesis | - |
dc.description.thesisname | Doctor of Philosophy | - |
dc.description.thesislevel | Doctoral | - |
dc.description.thesisdiscipline | Statistics and Actuarial Science | - |
dc.description.nature | published_or_final_version | - |
dc.date.hkucongregation | 2025 | - |
dc.identifier.mmsid | 991044897477203414 | - |