Generative adversarial networks for imputing missing data for big data clinical research

DONG, W; Fong, DYT; Yoon, JS; Wan, YFE; Bedford, LE; Tang, HM; Lam, CLK

File Download

There are no files associated with this item.

Links for fulltext

(May Require Subscription)

Publisher Website: 10.1186/s12874-021-01272-3
WOS: WOS:000642555800002

Supplementary

Citations:
- Web of Science: 0
Appears in Collections:
- Family Medicine and Primary Care: Journal/Magazine Articles
- Nursing Studies: Journal/Magazine Articles

Article: Generative adversarial networks for imputing missing data for big data clinical research

Title	Generative adversarial networks for imputing missing data for big data clinical research
Authors	DONG, W Fong, DYT Yoon, JS Wan, YFE Bedford, LE Tang, HM Lam, CLK
Issue Date	2021
Citation	BMC Medical Research Methodology, 2021, v. 21 How to Cite? DOI: http://dx.doi.org/10.1186/s12874-021-01272-3
Abstract	Background Missing data is a pervasive problem in clinical research. Generative adversarial imputation nets (GAIN), a novel machine learning data imputation approach, has the potential to substitute missing data accurately and efficiently but has not yet been evaluated in empirical big clinical datasets. Objectives This study aimed to evaluate the accuracy of GAIN in imputing missing values in large real-world clinical datasets with mixed-type variables. The computation efficiency of GAIN was also evaluated. The performance of GAIN was compared with other commonly used methods, MICE and missForest. Methods Two real world clinical datasets were used. The first was that of a cohort study on the long-term outcomes of patients with diabetes (50,000 complete cases), and the second was of a cohort study on the effectiveness of a risk assessment and management programme for patients with hypertension (10,000 complete cases). Missing data (missing at random) to independent variables were simulated at different missingness rates (20, 50%). The normalized root mean square error (NRMSE) between imputed values and real values for continuous variables and the proportion of falsely classified (PFC) for categorical variables were used to measure imputation accuracy. Computation time per imputation for each method was recorded. The differences in accuracy of different imputation methods were compared using ANOVA or non-parametric test. Results Both missForest and GAIN were more accurate than MICE. GAIN showed similar accuracy as missForest when the simulated missingness rate was 20%, but was more accurate when the simulated missingness rate was 50%. GAIN was the most accurate for the imputation of skewed continuous and imbalanced categorical variables at both missingness rates. GAIN had a much higher computation speed (32 min on PC) comparing to that of missForest (1300 min) when the sample size is 50,000. Conclusion GAIN showed better accuracy as an imputation method for missing data in large real-world clinical datasets compared to MICE and missForest, and was more resistant to high missingness rate (50%). The high computation speed is an added advantage of GAIN in big clinical data research. It holds potential as an accurate and efficient method for missing data imputation in future big data clinical research. Trial registration ClinicalTrials.gov ID: NCT03299010; Unique Protocol ID: HKUCTR-2232
Persistent Identifier	http://hdl.handle.net/10722/313566
ISI Accession Number ID	WOS:000642555800002

DC Field	Value	Language
dc.contributor.author	DONG, W	-
dc.contributor.author	Fong, DYT	-
dc.contributor.author	Yoon, JS	-
dc.contributor.author	Wan, YFE	-
dc.contributor.author	Bedford, LE	-
dc.contributor.author	Tang, HM	-
dc.contributor.author	Lam, CLK	-
dc.date.accessioned	2022-06-17T06:48:18Z	-
dc.date.available	2022-06-17T06:48:18Z	-
dc.date.issued	2021	-
dc.identifier.citation	BMC Medical Research Methodology, 2021, v. 21	-
dc.identifier.uri	http://hdl.handle.net/10722/313566	-
dc.description.abstract	Background Missing data is a pervasive problem in clinical research. Generative adversarial imputation nets (GAIN), a novel machine learning data imputation approach, has the potential to substitute missing data accurately and efficiently but has not yet been evaluated in empirical big clinical datasets. Objectives This study aimed to evaluate the accuracy of GAIN in imputing missing values in large real-world clinical datasets with mixed-type variables. The computation efficiency of GAIN was also evaluated. The performance of GAIN was compared with other commonly used methods, MICE and missForest. Methods Two real world clinical datasets were used. The first was that of a cohort study on the long-term outcomes of patients with diabetes (50,000 complete cases), and the second was of a cohort study on the effectiveness of a risk assessment and management programme for patients with hypertension (10,000 complete cases). Missing data (missing at random) to independent variables were simulated at different missingness rates (20, 50%). The normalized root mean square error (NRMSE) between imputed values and real values for continuous variables and the proportion of falsely classified (PFC) for categorical variables were used to measure imputation accuracy. Computation time per imputation for each method was recorded. The differences in accuracy of different imputation methods were compared using ANOVA or non-parametric test. Results Both missForest and GAIN were more accurate than MICE. GAIN showed similar accuracy as missForest when the simulated missingness rate was 20%, but was more accurate when the simulated missingness rate was 50%. GAIN was the most accurate for the imputation of skewed continuous and imbalanced categorical variables at both missingness rates. GAIN had a much higher computation speed (32 min on PC) comparing to that of missForest (1300 min) when the sample size is 50,000. Conclusion GAIN showed better accuracy as an imputation method for missing data in large real-world clinical datasets compared to MICE and missForest, and was more resistant to high missingness rate (50%). The high computation speed is an added advantage of GAIN in big clinical data research. It holds potential as an accurate and efficient method for missing data imputation in future big data clinical research. Trial registration ClinicalTrials.gov ID: NCT03299010; Unique Protocol ID: HKUCTR-2232	-
dc.language	eng	-
dc.relation.ispartof	BMC Medical Research Methodology	-
dc.title	Generative adversarial networks for imputing missing data for big data clinical research	-
dc.type	Article	-
dc.identifier.email	Fong, DYT: dytfong@hku.hk	-
dc.identifier.email	Wan, YFE: yfwan@hku.hk	-
dc.identifier.email	Tang, HM: erichm@hku.hk	-
dc.identifier.email	Lam, CLK: clklam@hku.hk	-
dc.identifier.authority	Fong, DYT=rp00253	-
dc.identifier.authority	Wan, YFE=rp02518	-
dc.identifier.authority	Lam, CLK=rp00350	-
dc.identifier.doi	10.1186/s12874-021-01272-3	-
dc.identifier.hkuros	333730	-
dc.identifier.volume	21	-
dc.identifier.isi	WOS:000642555800002	-

File Download

Links for fulltext

(May Require Subscription)

Supplementary

Article: Generative adversarial networks for imputing missing data for big data clinical research

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats