Development and validation of explainable machine-learning prediction systems : a study of biomedical and clinical data

Ng, Yui Lun; 吳鋭麟

File Download

FullText.pdf

Supplementary

Citations:
Appears in Collections:
- HKU Theses Online
- Mechanical Engineering: Theses

postgraduate thesis: Development and validation of explainable machine-learning prediction systems : a study of biomedical and clinical data

Title	Development and validation of explainable machine-learning prediction systems : a study of biomedical and clinical data
Authors	Ng, Yui Lun 吳鋭麟
Advisors	Advisor(s):Kwok, KW
Issue Date	2024
Publisher	The University of Hong Kong (Pokfulam, Hong Kong)
Citation	Ng, Y. L. [吳鋭麟]. (2024). Development and validation of explainable machine-learning prediction systems : a study of biomedical and clinical data. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
Abstract	Recent years have witnessed the rapid development and growing popularity of machine learning (ML) algorithms and explainability methods. ML is renowned for its exceptional ability to capture key features and patterns in high-dimensional data, therefore it is suitable for applications in biomedical and clinical data. These advantages have led to the emergence of explainable ML methods for predicting disease risk, estimating patient readmission likelihood, and forecasting care needs. However, while recent studies have primarily focused on achieving high predictive accuracy, it is essential to develop ML frameworks that not only achieve high predictive accuracy but also provide interpretability and transparency. The main focus of this thesis is to propose a generic workflow that integrates procedures essential for the development of explainable ML systems, including (i) the categorization of data types, (ii) the selection of appropriate ML algorithms, (iii) the choice of evaluation metrics, and (iv) the utilization of explainability methods. Categorizing data types allows a comprehensive understanding of the unique characteristics inherent in a dataset. This facilitates the assessment of the suitability of ML algorithms and the identification of any necessary data preprocessing steps. The selection of ML algorithms can have varying impacts on performance. Evaluation metrics play a crucial role in providing quantitative measures to assess the performance of different algorithms or settings. Explainability methods enable the generation of interpretable explanations, which allow us to understand the factors and features that contribute to the models’ predictions or decisions. The first part of this thesis studied the workflow involved in developing an explainable ML framework for structured electronic health record data. Patients with Clostridioides difficile infection who are at risk of mortality or recurrence were utilized to develop an open-access web-based prediction system aimed at estimating their outcome. Prognostic models, including four various types of ML algorithms and statistical logistics regression models, were developed, and compared to determine the optimal ML algorithms for this type of data. Explainability methods were employed to identify which features are crucial to the ML models and associate them with clinical findings. The second part of this thesis focused on the development of ML platforms for predicting enzyme function based on protein structures. Protein structure data are mainly unstructured or semi-structured data. This can be modeled using graph representation while graph neural networks can be leveraged to extract relevant features. To pinpoint catalytic amino acid residues related to the enzyme function, several explainability methods were investigated to assess their effectiveness. The proposed framework can be readily integrated with AlphaFold 2-predicted structures, as an end-to-end framework for deriving enzymatic functions and active sites from input protein sequences. The last part of this thesis highlights future research directions and potential enhancements for the proposed techniques. To summarize, this thesis studied the procedures crucial for the development of explainable ML systems based on biomedical data.
Degree	Doctor of Philosophy
Subject	Medical informatics Machine learning
Dept/Program	Mechanical Engineering
Persistent Identifier	http://hdl.handle.net/10722/358266

DC Field	Value	Language
dc.contributor.advisor	Kwok, KW	-
dc.contributor.author	Ng, Yui Lun	-
dc.contributor.author	吳鋭麟	-
dc.date.accessioned	2025-07-28T08:40:43Z	-
dc.date.available	2025-07-28T08:40:43Z	-
dc.date.issued	2024	-
dc.identifier.citation	Ng, Y. L. [吳鋭麟]. (2024). Development and validation of explainable machine-learning prediction systems : a study of biomedical and clinical data. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.	-
dc.identifier.uri	http://hdl.handle.net/10722/358266	-
dc.description.abstract	Recent years have witnessed the rapid development and growing popularity of machine learning (ML) algorithms and explainability methods. ML is renowned for its exceptional ability to capture key features and patterns in high-dimensional data, therefore it is suitable for applications in biomedical and clinical data. These advantages have led to the emergence of explainable ML methods for predicting disease risk, estimating patient readmission likelihood, and forecasting care needs. However, while recent studies have primarily focused on achieving high predictive accuracy, it is essential to develop ML frameworks that not only achieve high predictive accuracy but also provide interpretability and transparency. The main focus of this thesis is to propose a generic workflow that integrates procedures essential for the development of explainable ML systems, including (i) the categorization of data types, (ii) the selection of appropriate ML algorithms, (iii) the choice of evaluation metrics, and (iv) the utilization of explainability methods. Categorizing data types allows a comprehensive understanding of the unique characteristics inherent in a dataset. This facilitates the assessment of the suitability of ML algorithms and the identification of any necessary data preprocessing steps. The selection of ML algorithms can have varying impacts on performance. Evaluation metrics play a crucial role in providing quantitative measures to assess the performance of different algorithms or settings. Explainability methods enable the generation of interpretable explanations, which allow us to understand the factors and features that contribute to the models’ predictions or decisions. The first part of this thesis studied the workflow involved in developing an explainable ML framework for structured electronic health record data. Patients with Clostridioides difficile infection who are at risk of mortality or recurrence were utilized to develop an open-access web-based prediction system aimed at estimating their outcome. Prognostic models, including four various types of ML algorithms and statistical logistics regression models, were developed, and compared to determine the optimal ML algorithms for this type of data. Explainability methods were employed to identify which features are crucial to the ML models and associate them with clinical findings. The second part of this thesis focused on the development of ML platforms for predicting enzyme function based on protein structures. Protein structure data are mainly unstructured or semi-structured data. This can be modeled using graph representation while graph neural networks can be leveraged to extract relevant features. To pinpoint catalytic amino acid residues related to the enzyme function, several explainability methods were investigated to assess their effectiveness. The proposed framework can be readily integrated with AlphaFold 2-predicted structures, as an end-to-end framework for deriving enzymatic functions and active sites from input protein sequences. The last part of this thesis highlights future research directions and potential enhancements for the proposed techniques. To summarize, this thesis studied the procedures crucial for the development of explainable ML systems based on biomedical data.	-
dc.language	eng	-
dc.publisher	The University of Hong Kong (Pokfulam, Hong Kong)	-
dc.relation.ispartof	HKU Theses Online (HKUTO)	-
dc.rights	The author retains all proprietary rights, (such as patent rights) and the right to use in future works.	-
dc.rights	This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.	-
dc.subject.lcsh	Medical informatics	-
dc.subject.lcsh	Machine learning	-
dc.title	Development and validation of explainable machine-learning prediction systems : a study of biomedical and clinical data	-
dc.type	PG_Thesis	-
dc.description.thesisname	Doctor of Philosophy	-
dc.description.thesislevel	Doctoral	-
dc.description.thesisdiscipline	Mechanical Engineering	-
dc.description.nature	published_or_final_version	-
dc.date.hkucongregation	2024	-
dc.identifier.mmsid	991044843668303414	-

File Download

Supplementary

postgraduate thesis: Development and validation of explainable machine-learning prediction systems : a study of biomedical and clinical data

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats