File Download
Supplementary

postgraduate thesis: Random forest boosts genetic risk prediction of systemic lupus erythematosus (SLE) but does not distinguish between patients with lupus nephritis (LN) and non-LN

TitleRandom forest boosts genetic risk prediction of systemic lupus erythematosus (SLE) but does not distinguish between patients with lupus nephritis (LN) and non-LN
Authors
Advisors
Advisor(s):Lau, YLYang, W
Issue Date2023
PublisherThe University of Hong Kong (Pokfulam, Hong Kong)
Citation
Ma, W. [马文]. (2023). Random forest boosts genetic risk prediction of systemic lupus erythematosus (SLE) but does not distinguish between patients with lupus nephritis (LN) and non-LN. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
AbstractSystemic lupus erythematosus (SLE) is a common autoimmune disease that affects several vital organs, including heart, brain, kidneys, joints, skin, and central nervous system. Due to the heterogeneity of SLE, it is difficult to make a prognosis or early diagnosis based solely on biomarker tests such as anti-nuclear antibody and anti-dsDNA tests, because they are non-specific for SLE. With the increased availability of genome Single Nucleotide Polymorphism genotyping for SLE, such genetic information can aid early diagnosis and prediction of the risk of developing SLE. Advances in precision medicine can also aid in the assessment of the risk of SLE in individuals. The typical method for predicting disease risk called polygenic risk score (PRS) is based on genotype data, but it often exhibits poor predictive results as it does not take into account the relationship between alleles. Hence, we proposed to apply three classical supervised machine learning (ML) models: Random Forest (RF), Support Vector Machine (SVM), and Artificial Neural Network (ANN) to capture genetic correlations to improve risk predictions for developing SLE. Among the three ML models, RF was shown to be most efficient with the least training time and higher performance for the prediction of SLE compared to lasso-sum PRS, which is one of the best PRS models. Specifically, RF produced the highest mean prediction AUC of 84% on the Chinese dataset and 76% on the European dataset, which is an improvement of 10% and 11% over lasso-sum PRS, respectively. The SVM and ANN models performed comparably, with mean AUC values of 0.77 and 0.76 in the Chinese dataset, respectively, which was slightly higher than in the PRS model (mean AUC = 0.74). A similar pattern was found in the European dataset. Approximately 50%-70% of SLE patients develop lupus nephritis (LN), which has the highest mortality rate among these patients. There are very few specific predictive models that can aid in the early diagnosis of LN. To fill this gap, we investigated the predictive power of RF, SVM, and ANN compared to lasso-sum PRS. Using the Hong Kong data, we performed predictions on two groups: 1) only LN and non-LN (NLN) samples, and 2) LN and NLN samples with control samples. Using only LN and NLN samples, all four models could not well distinguish between LN and NLN, with the best average AUC of 0.55 achieved by ANN. Furthermore, adding control samples did not significantly improve the predictive ability of the models in distinguishing between LN and NLN. Nevertheless, RF had the best mean AUC of 0.89 for differentiating between control and LN samples in the three-class classification, which was an improvement of 12% over lasso-sum PRS (mean AUC = 0.77).
DegreeDoctor of Philosophy
SubjectMachine learning
Systemic lupus erythematosus
Dept/ProgramPaediatrics and Adolescent Medicine
Persistent Identifierhttp://hdl.handle.net/10722/335957

 

DC FieldValueLanguage
dc.contributor.advisorLau, YL-
dc.contributor.advisorYang, W-
dc.contributor.authorMa, Wen-
dc.contributor.author马文-
dc.date.accessioned2023-12-29T04:05:10Z-
dc.date.available2023-12-29T04:05:10Z-
dc.date.issued2023-
dc.identifier.citationMa, W. [马文]. (2023). Random forest boosts genetic risk prediction of systemic lupus erythematosus (SLE) but does not distinguish between patients with lupus nephritis (LN) and non-LN. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.-
dc.identifier.urihttp://hdl.handle.net/10722/335957-
dc.description.abstractSystemic lupus erythematosus (SLE) is a common autoimmune disease that affects several vital organs, including heart, brain, kidneys, joints, skin, and central nervous system. Due to the heterogeneity of SLE, it is difficult to make a prognosis or early diagnosis based solely on biomarker tests such as anti-nuclear antibody and anti-dsDNA tests, because they are non-specific for SLE. With the increased availability of genome Single Nucleotide Polymorphism genotyping for SLE, such genetic information can aid early diagnosis and prediction of the risk of developing SLE. Advances in precision medicine can also aid in the assessment of the risk of SLE in individuals. The typical method for predicting disease risk called polygenic risk score (PRS) is based on genotype data, but it often exhibits poor predictive results as it does not take into account the relationship between alleles. Hence, we proposed to apply three classical supervised machine learning (ML) models: Random Forest (RF), Support Vector Machine (SVM), and Artificial Neural Network (ANN) to capture genetic correlations to improve risk predictions for developing SLE. Among the three ML models, RF was shown to be most efficient with the least training time and higher performance for the prediction of SLE compared to lasso-sum PRS, which is one of the best PRS models. Specifically, RF produced the highest mean prediction AUC of 84% on the Chinese dataset and 76% on the European dataset, which is an improvement of 10% and 11% over lasso-sum PRS, respectively. The SVM and ANN models performed comparably, with mean AUC values of 0.77 and 0.76 in the Chinese dataset, respectively, which was slightly higher than in the PRS model (mean AUC = 0.74). A similar pattern was found in the European dataset. Approximately 50%-70% of SLE patients develop lupus nephritis (LN), which has the highest mortality rate among these patients. There are very few specific predictive models that can aid in the early diagnosis of LN. To fill this gap, we investigated the predictive power of RF, SVM, and ANN compared to lasso-sum PRS. Using the Hong Kong data, we performed predictions on two groups: 1) only LN and non-LN (NLN) samples, and 2) LN and NLN samples with control samples. Using only LN and NLN samples, all four models could not well distinguish between LN and NLN, with the best average AUC of 0.55 achieved by ANN. Furthermore, adding control samples did not significantly improve the predictive ability of the models in distinguishing between LN and NLN. Nevertheless, RF had the best mean AUC of 0.89 for differentiating between control and LN samples in the three-class classification, which was an improvement of 12% over lasso-sum PRS (mean AUC = 0.77).-
dc.languageeng-
dc.publisherThe University of Hong Kong (Pokfulam, Hong Kong)-
dc.relation.ispartofHKU Theses Online (HKUTO)-
dc.rightsThe author retains all proprietary rights, (such as patent rights) and the right to use in future works.-
dc.rightsThis work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.-
dc.subject.lcshMachine learning-
dc.subject.lcshSystemic lupus erythematosus-
dc.titleRandom forest boosts genetic risk prediction of systemic lupus erythematosus (SLE) but does not distinguish between patients with lupus nephritis (LN) and non-LN-
dc.typePG_Thesis-
dc.description.thesisnameDoctor of Philosophy-
dc.description.thesislevelDoctoral-
dc.description.thesisdisciplinePaediatrics and Adolescent Medicine-
dc.description.naturepublished_or_final_version-
dc.date.hkucongregation2024-
dc.identifier.mmsid991044751042003414-

Export via OAI-PMH Interface in XML Formats


OR


Export to Other Non-XML Formats