A Biomarker-Driven and Interpretable Machine Learning Model for Diagnosing Diabetes Mellitus

Xiao, Zhihui; Wang, Mingfu; Zhao, Yueliang; Wang, Hui

File Download

content.pdf

Links for fulltext

(May Require Subscription)

Publisher Website: 10.1002/fsn3.70234
Scopus: eid_2-s2.0-105004209467

Supplementary

Citations:
- Scopus: 0
Appears in Collections:
- Biological Sciences: Journal/Magazine Articles

Article: A Biomarker-Driven and Interpretable Machine Learning Model for Diagnosing Diabetes Mellitus

Title	A Biomarker-Driven and Interpretable Machine Learning Model for Diagnosing Diabetes Mellitus
Authors	Xiao, Zhihui Wang, Mingfu Zhao, Yueliang Wang, Hui
Keywords	biomarker-driven diabetes mellitus interpretable machine learning prediction model
Issue Date	30-Apr-2025
Publisher	Wiley Open Access
Citation	Food Science & Nutrition, 2025, v. 13, n. 5 How to Cite? DOI: http://dx.doi.org/10.1002/fsn3.70234
Abstract	Diabetes is one of the leading causes of death and disability worldwide. Developing earlier and more accurate diagnosis methods is crucial for clinical prevention and treatment of diabetes. Here, data on biochemical indicators and physiological characteristics of 4335 participants from the National Health and Nutrition Examination Survey (NHANES) database from 2017 to 2020 were collected. After data preprocessing, the dataset was randomly divided into a training set (70%) and a test set (30%); then the Boruta algorithm was used to screen feature indicators on the training set. Next, three machine learning algorithms, including Random Forest (RF), Multi-Layer Perceptron (MLP), and Extreme Gradient Boosting (XGBoost) were employed to build predictive models through 10-fold cross-validation on the training dataset, followed by performance evaluation on the test dataset. The RF model exhibited the best performance, with an area under the curve (AUC) of 0.958 (95% CI: 0.943–0.973), a recall of 0.897, a specificity and F1 score of 0.916 and 0.747, respectively, and an overall accuracy of 0.913. Moreover, SHapley Additive exPlanations (SHAP) and Partial Dependency Plots (PDP) were applied to interpret the RF model to analyze the risk factors for diabetes. Glycohemoglobin, glucose, fasting glucose, age, cholesterol, osmolality, BMI, blood urea nitrogen, and insulin were found to exert the greatest influence on the prevalence of diabetes. Collectively, the RF model has considerable application prospects for the diagnosis of diabetes and can serve as a valuable supplementary tool for clinical diagnosis and risk assessment in diabetes.
Persistent Identifier	http://hdl.handle.net/10722/367318

DC Field	Value	Language
dc.contributor.author	Xiao, Zhihui	-
dc.contributor.author	Wang, Mingfu	-
dc.contributor.author	Zhao, Yueliang	-
dc.contributor.author	Wang, Hui	-
dc.date.accessioned	2025-12-10T08:06:31Z	-
dc.date.available	2025-12-10T08:06:31Z	-
dc.date.issued	2025-04-30	-
dc.identifier.citation	Food Science & Nutrition, 2025, v. 13, n. 5	-
dc.identifier.uri	http://hdl.handle.net/10722/367318	-
dc.description.abstract	Diabetes is one of the leading causes of death and disability worldwide. Developing earlier and more accurate diagnosis methods is crucial for clinical prevention and treatment of diabetes. Here, data on biochemical indicators and physiological characteristics of 4335 participants from the National Health and Nutrition Examination Survey (NHANES) database from 2017 to 2020 were collected. After data preprocessing, the dataset was randomly divided into a training set (70%) and a test set (30%); then the Boruta algorithm was used to screen feature indicators on the training set. Next, three machine learning algorithms, including Random Forest (RF), Multi-Layer Perceptron (MLP), and Extreme Gradient Boosting (XGBoost) were employed to build predictive models through 10-fold cross-validation on the training dataset, followed by performance evaluation on the test dataset. The RF model exhibited the best performance, with an area under the curve (AUC) of 0.958 (95% CI: 0.943–0.973), a recall of 0.897, a specificity and F1 score of 0.916 and 0.747, respectively, and an overall accuracy of 0.913. Moreover, SHapley Additive exPlanations (SHAP) and Partial Dependency Plots (PDP) were applied to interpret the RF model to analyze the risk factors for diabetes. Glycohemoglobin, glucose, fasting glucose, age, cholesterol, osmolality, BMI, blood urea nitrogen, and insulin were found to exert the greatest influence on the prevalence of diabetes. Collectively, the RF model has considerable application prospects for the diagnosis of diabetes and can serve as a valuable supplementary tool for clinical diagnosis and risk assessment in diabetes.	-
dc.language	eng	-
dc.publisher	Wiley Open Access	-
dc.relation.ispartof	Food Science & Nutrition	-
dc.rights	This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.	-
dc.subject	biomarker-driven	-
dc.subject	diabetes mellitus	-
dc.subject	interpretable	-
dc.subject	machine learning	-
dc.subject	prediction model	-
dc.title	A Biomarker-Driven and Interpretable Machine Learning Model for Diagnosing Diabetes Mellitus	-
dc.type	Article	-
dc.description.nature	published_or_final_version	-
dc.identifier.doi	10.1002/fsn3.70234	-
dc.identifier.scopus	eid_2-s2.0-105004209467	-
dc.identifier.volume	13	-
dc.identifier.issue	5	-
dc.identifier.eissn	2048-7177	-
dc.identifier.issnl	2048-7177	-

File Download

Links for fulltext

(May Require Subscription)

Supplementary

Article: A Biomarker-Driven and Interpretable Machine Learning Model for Diagnosing Diabetes Mellitus

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats