Statistical and machine learning methods and algorithms for analyzing data from omics technologies

Yan, Kang; 嚴康

File Download

FullText.pdf

Supplementary

Citations:
Appears in Collections:
- HKU Theses Online
- Public Health: Theses

postgraduate thesis: Statistical and machine learning methods and algorithms for analyzing data from omics technologies

Title	Statistical and machine learning methods and algorithms for analyzing data from omics technologies
Authors	Yan, Kang 嚴康
Advisors	Advisor(s):Pang, HMH Leung, GM Wu, JTK
Issue Date	2020
Publisher	The University of Hong Kong (Pokfulam, Hong Kong)
Citation	Yan, K. [嚴康]. (2020). Statistical and machine learning methods and algorithms for analyzing data from omics technologies. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
Abstract	We are now in an era of massive high-throughput omics data. Understanding the structure and characteristics underlying various omics data as well as choosing the appropriate analytical methods are crucial to the correct interpretation of the underlying biological and disease mechanisms. However, numerous human omics data studies use aging algorithms that do not fully utilize the potentials of omics data. Hence, there is a high demand for developing computational tools that can be applied to the storage, processing, analysis, and interpretation of various omics data. More specifically, it is imperative to develop well-designed computational, mathematical, statistical, and machine learning analytical approaches to enable improved analysis and to better interpret omics data for human studies. In envisaging such needs, this research focused on analytical algorithms for genetic and phenotypic trait association analysis with various omics data generated from different omics technologies. The recent developments in omics technologies have offered us an unprecedented opportunity to understand human health and complex diseases through the utilization of different omics features. Integrating diverse data sources can facilitate thorough and extensive analysis of complex phenotypic traits through discovering patterns that are evidently spotted across different experiments. Therefore, this thesis first provided a comprehensive comparison and evaluation of graph- and kernel-based omics integration classification algorithms by taking into account the various classification performance metrics as well as the computation time. The empirical evaluation on hypertension, breast and ovarian cancer data sets suggested that the better performers were composite association network, relevance vector machine and Ada-boost relevance vector machine. Biomedical imaging, as a powerful technique for visualization of biological activities and structures, is generally less invasive than some existing clinical examinations and inspection for the diagnosis and prognosis of diseases. Numerous radiomics features derived from biomedical imaging can be utilized for the determination of diseases and the prediction of therapeutic responses. A novel machine learning analytical framework that better utilize the high-dimension characteristic of radiomics features extracted from biomedical imaging for right-censored survival outcomes are presented accordingly. The expression quantitative trait loci (eQTL) analysis involves the discovery of genetic variants that reveal the role of genetic variants in regulating gene expression. This thesis presented a penalty-based multivariable regression model for the simultaneous discovery of multiple phenotypic trait-associated genetic variants while accounting for non-genetic and genetic confounding, and a Bayesian hierarchical framework. This framework utilizes the summary statistics to jointly identify the credible set of true eQTLs across multi-tissues with the modeling of linkage disequilibrium structure of genetic variants and corresponding epigenetic annotations. Experiments with simulated scenarios,imaging data from non-small cell lung cancer and head and neck cancer, and eQTL data retrieved from the Genotype-Tissue Expression (GTEx) consortium successfully demonstrated the improved performance of the three proposed algorithms over some existing methodologies. In general, the thesis contributes to the development and implementation of statistical and machine learning approaches for analyzing various omics data types with genetic, phenotypic, and survival traits.
Degree	Doctor of Philosophy
Subject	Computational biology Genomics - Statistical methods Meta-analysis
Dept/Program	Public Health
Persistent Identifier	http://hdl.handle.net/10722/301048

DC Field	Value	Language
dc.contributor.advisor	Pang, HMH	-
dc.contributor.advisor	Leung, GM	-
dc.contributor.advisor	Wu, JTK	-
dc.contributor.author	Yan, Kang	-
dc.contributor.author	嚴康	-
dc.date.accessioned	2021-07-16T14:38:43Z	-
dc.date.available	2021-07-16T14:38:43Z	-
dc.date.issued	2020	-
dc.identifier.citation	Yan, K. [嚴康]. (2020). Statistical and machine learning methods and algorithms for analyzing data from omics technologies. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.	-
dc.identifier.uri	http://hdl.handle.net/10722/301048	-
dc.description.abstract	We are now in an era of massive high-throughput omics data. Understanding the structure and characteristics underlying various omics data as well as choosing the appropriate analytical methods are crucial to the correct interpretation of the underlying biological and disease mechanisms. However, numerous human omics data studies use aging algorithms that do not fully utilize the potentials of omics data. Hence, there is a high demand for developing computational tools that can be applied to the storage, processing, analysis, and interpretation of various omics data. More specifically, it is imperative to develop well-designed computational, mathematical, statistical, and machine learning analytical approaches to enable improved analysis and to better interpret omics data for human studies. In envisaging such needs, this research focused on analytical algorithms for genetic and phenotypic trait association analysis with various omics data generated from different omics technologies. The recent developments in omics technologies have offered us an unprecedented opportunity to understand human health and complex diseases through the utilization of different omics features. Integrating diverse data sources can facilitate thorough and extensive analysis of complex phenotypic traits through discovering patterns that are evidently spotted across different experiments. Therefore, this thesis first provided a comprehensive comparison and evaluation of graph- and kernel-based omics integration classification algorithms by taking into account the various classification performance metrics as well as the computation time. The empirical evaluation on hypertension, breast and ovarian cancer data sets suggested that the better performers were composite association network, relevance vector machine and Ada-boost relevance vector machine. Biomedical imaging, as a powerful technique for visualization of biological activities and structures, is generally less invasive than some existing clinical examinations and inspection for the diagnosis and prognosis of diseases. Numerous radiomics features derived from biomedical imaging can be utilized for the determination of diseases and the prediction of therapeutic responses. A novel machine learning analytical framework that better utilize the high-dimension characteristic of radiomics features extracted from biomedical imaging for right-censored survival outcomes are presented accordingly. The expression quantitative trait loci (eQTL) analysis involves the discovery of genetic variants that reveal the role of genetic variants in regulating gene expression. This thesis presented a penalty-based multivariable regression model for the simultaneous discovery of multiple phenotypic trait-associated genetic variants while accounting for non-genetic and genetic confounding, and a Bayesian hierarchical framework. This framework utilizes the summary statistics to jointly identify the credible set of true eQTLs across multi-tissues with the modeling of linkage disequilibrium structure of genetic variants and corresponding epigenetic annotations. Experiments with simulated scenarios,imaging data from non-small cell lung cancer and head and neck cancer, and eQTL data retrieved from the Genotype-Tissue Expression (GTEx) consortium successfully demonstrated the improved performance of the three proposed algorithms over some existing methodologies. In general, the thesis contributes to the development and implementation of statistical and machine learning approaches for analyzing various omics data types with genetic, phenotypic, and survival traits.	-
dc.language	eng	-
dc.publisher	The University of Hong Kong (Pokfulam, Hong Kong)	-
dc.relation.ispartof	HKU Theses Online (HKUTO)	-
dc.rights	The author retains all proprietary rights, (such as patent rights) and the right to use in future works.	-
dc.rights	This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.	-
dc.subject.lcsh	Computational biology	-
dc.subject.lcsh	Genomics - Statistical methods	-
dc.subject.lcsh	Meta-analysis	-
dc.title	Statistical and machine learning methods and algorithms for analyzing data from omics technologies	-
dc.type	PG_Thesis	-
dc.description.thesisname	Doctor of Philosophy	-
dc.description.thesislevel	Doctoral	-
dc.description.thesisdiscipline	Public Health	-
dc.description.nature	published_or_final_version	-
dc.date.hkucongregation	2020	-
dc.identifier.mmsid	991044284192303414	-

File Download

Supplementary

postgraduate thesis: Statistical and machine learning methods and algorithms for analyzing data from omics technologies

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats