File Download
Links for fulltext
(May Require Subscription)
- Publisher Website: 10.1002/fsn3.70234
- Scopus: eid_2-s2.0-105004209467
Supplementary
-
Citations:
- Scopus: 0
- Appears in Collections:
Article: A Biomarker-Driven and Interpretable Machine Learning Model for Diagnosing Diabetes Mellitus
| Title | A Biomarker-Driven and Interpretable Machine Learning Model for Diagnosing Diabetes Mellitus |
|---|---|
| Authors | |
| Keywords | biomarker-driven diabetes mellitus interpretable machine learning prediction model |
| Issue Date | 30-Apr-2025 |
| Publisher | Wiley Open Access |
| Citation | Food Science & Nutrition, 2025, v. 13, n. 5 How to Cite? |
| Abstract | Diabetes is one of the leading causes of death and disability worldwide. Developing earlier and more accurate diagnosis methods is crucial for clinical prevention and treatment of diabetes. Here, data on biochemical indicators and physiological characteristics of 4335 participants from the National Health and Nutrition Examination Survey (NHANES) database from 2017 to 2020 were collected. After data preprocessing, the dataset was randomly divided into a training set (70%) and a test set (30%); then the Boruta algorithm was used to screen feature indicators on the training set. Next, three machine learning algorithms, including Random Forest (RF), Multi-Layer Perceptron (MLP), and Extreme Gradient Boosting (XGBoost) were employed to build predictive models through 10-fold cross-validation on the training dataset, followed by performance evaluation on the test dataset. The RF model exhibited the best performance, with an area under the curve (AUC) of 0.958 (95% CI: 0.943–0.973), a recall of 0.897, a specificity and F1 score of 0.916 and 0.747, respectively, and an overall accuracy of 0.913. Moreover, SHapley Additive exPlanations (SHAP) and Partial Dependency Plots (PDP) were applied to interpret the RF model to analyze the risk factors for diabetes. Glycohemoglobin, glucose, fasting glucose, age, cholesterol, osmolality, BMI, blood urea nitrogen, and insulin were found to exert the greatest influence on the prevalence of diabetes. Collectively, the RF model has considerable application prospects for the diagnosis of diabetes and can serve as a valuable supplementary tool for clinical diagnosis and risk assessment in diabetes. |
| Persistent Identifier | http://hdl.handle.net/10722/367318 |
| DC Field | Value | Language |
|---|---|---|
| dc.contributor.author | Xiao, Zhihui | - |
| dc.contributor.author | Wang, Mingfu | - |
| dc.contributor.author | Zhao, Yueliang | - |
| dc.contributor.author | Wang, Hui | - |
| dc.date.accessioned | 2025-12-10T08:06:31Z | - |
| dc.date.available | 2025-12-10T08:06:31Z | - |
| dc.date.issued | 2025-04-30 | - |
| dc.identifier.citation | Food Science & Nutrition, 2025, v. 13, n. 5 | - |
| dc.identifier.uri | http://hdl.handle.net/10722/367318 | - |
| dc.description.abstract | Diabetes is one of the leading causes of death and disability worldwide. Developing earlier and more accurate diagnosis methods is crucial for clinical prevention and treatment of diabetes. Here, data on biochemical indicators and physiological characteristics of 4335 participants from the National Health and Nutrition Examination Survey (NHANES) database from 2017 to 2020 were collected. After data preprocessing, the dataset was randomly divided into a training set (70%) and a test set (30%); then the Boruta algorithm was used to screen feature indicators on the training set. Next, three machine learning algorithms, including Random Forest (RF), Multi-Layer Perceptron (MLP), and Extreme Gradient Boosting (XGBoost) were employed to build predictive models through 10-fold cross-validation on the training dataset, followed by performance evaluation on the test dataset. The RF model exhibited the best performance, with an area under the curve (AUC) of 0.958 (95% CI: 0.943–0.973), a recall of 0.897, a specificity and F1 score of 0.916 and 0.747, respectively, and an overall accuracy of 0.913. Moreover, SHapley Additive exPlanations (SHAP) and Partial Dependency Plots (PDP) were applied to interpret the RF model to analyze the risk factors for diabetes. Glycohemoglobin, glucose, fasting glucose, age, cholesterol, osmolality, BMI, blood urea nitrogen, and insulin were found to exert the greatest influence on the prevalence of diabetes. Collectively, the RF model has considerable application prospects for the diagnosis of diabetes and can serve as a valuable supplementary tool for clinical diagnosis and risk assessment in diabetes. | - |
| dc.language | eng | - |
| dc.publisher | Wiley Open Access | - |
| dc.relation.ispartof | Food Science & Nutrition | - |
| dc.rights | This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. | - |
| dc.subject | biomarker-driven | - |
| dc.subject | diabetes mellitus | - |
| dc.subject | interpretable | - |
| dc.subject | machine learning | - |
| dc.subject | prediction model | - |
| dc.title | A Biomarker-Driven and Interpretable Machine Learning Model for Diagnosing Diabetes Mellitus | - |
| dc.type | Article | - |
| dc.description.nature | published_or_final_version | - |
| dc.identifier.doi | 10.1002/fsn3.70234 | - |
| dc.identifier.scopus | eid_2-s2.0-105004209467 | - |
| dc.identifier.volume | 13 | - |
| dc.identifier.issue | 5 | - |
| dc.identifier.eissn | 2048-7177 | - |
| dc.identifier.issnl | 2048-7177 | - |
