File Download

There are no files associated with this item.

Supplementary

Conference Paper: Comparison of Generative Adversarial Nets and Traditional Methods for Imputing Missing Data in Big Data Primary Care Research (Big Data)

TitleComparison of Generative Adversarial Nets and Traditional Methods for Imputing Missing Data in Big Data Primary Care Research (Big Data)
Authors
Issue Date2019
PublisherNorth American Primary Care Research Group (NAPCRG).
Citation
47th North American Primary Care Research Group (NAPCRG) Annual Meeting, Toronto, Ontario, Canada, 16-20 November 2019 How to Cite?
AbstractContext: Missing data is a pervasive problem that should be handled appropriately in clinical research. A novel deep learning-based method, generative adversarial imputation network (GAIN), was shown to be able to impute missing values accurately. This new imputation tool has potential application to large clinical data analytic research. Objective: This study aimed to compare GAIN imputation performance with traditional imputation methods of multiple imputation with chain equation (MICE) and missForest in order to identify the most appropriate imputation method for primary care research. The objectives were to compare the performance in terms of accuracy, precision and time efficiency of different imputation methods for different types of variables, different proportions of missing data and different sample sizes. Study Design: Simulation study Dataset: Data of a 10-year (Jan. 2008 to Dec. 2017) cohort study on a primary care population of patients with type 2 diabetes mellitus (N= 141,516) Outcome Measures: Normalized root mean square error (NRMSE) and proportion of falsely classified (PFC) to assess single imputation accuracy. Estimated coverage rate and parameters’ 95%CI width for multiple imputation predictive accuracy and precision. Computation time per imputation for time efficiency. Results: For single imputation accuracy, GAIN performed the best on the skewed continuous variables and imbalanced categorical variables, and missForest performed the best on the normal continuous and balanced categorical variables. GAIN performed the best when the data missing rate was >30%. For multiple imputation, missForest displayed the highest coverage rates and GAIN showed the narrowest 95% CI. GAIN showed an outstanding computation speed (31.33min/per imputation at 50,000 cases), which far outperformed the other two traditional methods (MICE: 51.00min/per imputation, missForest: 1140min/per imputation at 50,000 cases). Conclusions: Researchers can select the most suitable imputation method for missing data in big clinical data analytic research, taking into account the data characteristics, missing proportion, data distribution, sample size and available computation time. Furthermore, more than one imputation methods could be utilized to cross-validate the results and assure reliability.
DescriptionSession BD19: Oral Presentation On Completed Research
Persistent Identifierhttp://hdl.handle.net/10722/290042

 

DC FieldValueLanguage
dc.contributor.authorDong, W-
dc.contributor.authorLam, CLK-
dc.contributor.authorWan, YFE-
dc.contributor.authorTang, HM-
dc.contributor.authorWong, CKH-
dc.date.accessioned2020-10-22T08:21:12Z-
dc.date.available2020-10-22T08:21:12Z-
dc.date.issued2019-
dc.identifier.citation47th North American Primary Care Research Group (NAPCRG) Annual Meeting, Toronto, Ontario, Canada, 16-20 November 2019-
dc.identifier.urihttp://hdl.handle.net/10722/290042-
dc.descriptionSession BD19: Oral Presentation On Completed Research-
dc.description.abstractContext: Missing data is a pervasive problem that should be handled appropriately in clinical research. A novel deep learning-based method, generative adversarial imputation network (GAIN), was shown to be able to impute missing values accurately. This new imputation tool has potential application to large clinical data analytic research. Objective: This study aimed to compare GAIN imputation performance with traditional imputation methods of multiple imputation with chain equation (MICE) and missForest in order to identify the most appropriate imputation method for primary care research. The objectives were to compare the performance in terms of accuracy, precision and time efficiency of different imputation methods for different types of variables, different proportions of missing data and different sample sizes. Study Design: Simulation study Dataset: Data of a 10-year (Jan. 2008 to Dec. 2017) cohort study on a primary care population of patients with type 2 diabetes mellitus (N= 141,516) Outcome Measures: Normalized root mean square error (NRMSE) and proportion of falsely classified (PFC) to assess single imputation accuracy. Estimated coverage rate and parameters’ 95%CI width for multiple imputation predictive accuracy and precision. Computation time per imputation for time efficiency. Results: For single imputation accuracy, GAIN performed the best on the skewed continuous variables and imbalanced categorical variables, and missForest performed the best on the normal continuous and balanced categorical variables. GAIN performed the best when the data missing rate was >30%. For multiple imputation, missForest displayed the highest coverage rates and GAIN showed the narrowest 95% CI. GAIN showed an outstanding computation speed (31.33min/per imputation at 50,000 cases), which far outperformed the other two traditional methods (MICE: 51.00min/per imputation, missForest: 1140min/per imputation at 50,000 cases). Conclusions: Researchers can select the most suitable imputation method for missing data in big clinical data analytic research, taking into account the data characteristics, missing proportion, data distribution, sample size and available computation time. Furthermore, more than one imputation methods could be utilized to cross-validate the results and assure reliability.-
dc.languageeng-
dc.publisherNorth American Primary Care Research Group (NAPCRG). -
dc.relation.ispartof47th North American Primary Care Research Group (NAPCRG) Annual Meeting-
dc.titleComparison of Generative Adversarial Nets and Traditional Methods for Imputing Missing Data in Big Data Primary Care Research (Big Data)-
dc.typeConference_Paper-
dc.identifier.emailLam, CLK: clklam@hku.hk-
dc.identifier.emailWan, YFE: yfwan@hku.hk-
dc.identifier.emailTang, HM: erichm@hku.hk-
dc.identifier.emailWong, CKH: carlosho@hku.hk-
dc.identifier.authorityLam, CLK=rp00350-
dc.identifier.authorityWan, YFE=rp02518-
dc.identifier.authorityWong, CKH=rp01931-
dc.identifier.hkuros317346-
dc.publisher.placeToronto, Canada-

Export via OAI-PMH Interface in XML Formats


OR


Export to Other Non-XML Formats