File Download
Supplementary

postgraduate thesis: AI-driven identification and recovery of missing spatial-temporal data : a case study of missing data from air pollution monitoring stations

TitleAI-driven identification and recovery of missing spatial-temporal data : a case study of missing data from air pollution monitoring stations
Authors
Advisors
Advisor(s):Li, VOKLam, JCK
Issue Date2022
PublisherThe University of Hong Kong (Pokfulam, Hong Kong)
Citation
Yu, Y. [于洋文]. (2022). AI-driven identification and recovery of missing spatial-temporal data : a case study of missing data from air pollution monitoring stations. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
AbstractData loss, due to equipment and transmission failures, presents a key challenge to air quality monitoring. Since existing data imputation methods fail to capture the specific characteristics of missing air pollution data, the effective recovery of this data remains an unresolved and critical challenge. Accordingly, this study aims to analyse these characteristics and propose corresponding effective recovery methods. First, we investigate the low-rank property of short-term air pollution data and transform the missing recovery problem into a Low-Rank Matrix Completion (LRMC) problem. Based on duality theory, the formulated LRMC problem is further converted into a sub-gradient primal-dual problem, while Singular Value Thresholding (SVT) is developed to address this problem. Second, we analyse the temporal-spatial correlation and periodicity of long-term data, then propose a corresponding Long-Short Term Context Encoder (LSCE). As a variant of Generative Adversarial Networks (GANs) composed of Convolutional Neural Network (CNN) layers, our LSCE possesses the following novelties: (1) Data pre-processing converts the air pollution measurements into weekly image-like matrices to capture the similarity and periodicity of air pollutants; (2) CNN layers mine the temporal-spatial correlation from the incomplete data inputs to reconstruct a complete dataset; and (3) the GAN structure combines the generator with the discriminator, while using the joint loss function to improve recovery accuracy. Third, we study the weekday/weekend and seasonal variation of long-term air pollution data, while locating their local and non-local components. In response, we propose an Improved Long-Short Term Context Encoder (ILSCE) model, which benefits from three novelties: (1) A new CNN update mechanism, which allows the ILSCE to hierarchically recover data based on the non-empty inputs; (2) Periodicity labels of the inputs that allow the model to extract domain-specific features; and (3) Expert-led knowledge injected into the loss function to help the ILSCE capture both the local emissions and non-local background pollutants. In addition, based on LSCE and ILSCE, GCN-ST-MDR, a novel Graph Convolutional Network (GCN)-based framework, is proposed to identify the daily missing patterns and automatically select the best recovery method. GCN-ST-MDR presents three novelties: (1) New graph construction transforms the missing mask into a spatial-temporal (S-T) graph based on the similarity matrix to improve the extraction of GCN data representation for pattern identification; (2) Transfer Learning (TL) rapidly pre-trains the LSCE and ILSCE models; and (3) the GCN structure outputs a selection indicator to determine the dominating missing pattern for each data matrix input. The accuracy of the pre-trained data recovery models is subsequently incorporated into the loss function of the GCN component to penalise the incorrect indicator. SVT employs the low-rank property to recover short-term data. LSCE and ILSCE use the inherent nature of air pollution data to recover long-term data, while GCN-ST-MDR is developed to automatically assign the best of the two models to recover any missing patterns. Our results demonstrate that, regardless of data size and type, our data recovery framework can comprehensively and effectively recover missing air pollution data with higher accuracy, compared to state-of-the-art data recovery models.
DegreeDoctor of Philosophy
SubjectAir quality monitoring stations
Missing observations (Statistics)
Dept/ProgramElectrical and Electronic Engineering
Persistent Identifierhttp://hdl.handle.net/10722/318425

 

DC FieldValueLanguage
dc.contributor.advisorLi, VOK-
dc.contributor.advisorLam, JCK-
dc.contributor.authorYu, Yangwen-
dc.contributor.author于洋文-
dc.date.accessioned2022-10-10T08:18:57Z-
dc.date.available2022-10-10T08:18:57Z-
dc.date.issued2022-
dc.identifier.citationYu, Y. [于洋文]. (2022). AI-driven identification and recovery of missing spatial-temporal data : a case study of missing data from air pollution monitoring stations. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.-
dc.identifier.urihttp://hdl.handle.net/10722/318425-
dc.description.abstractData loss, due to equipment and transmission failures, presents a key challenge to air quality monitoring. Since existing data imputation methods fail to capture the specific characteristics of missing air pollution data, the effective recovery of this data remains an unresolved and critical challenge. Accordingly, this study aims to analyse these characteristics and propose corresponding effective recovery methods. First, we investigate the low-rank property of short-term air pollution data and transform the missing recovery problem into a Low-Rank Matrix Completion (LRMC) problem. Based on duality theory, the formulated LRMC problem is further converted into a sub-gradient primal-dual problem, while Singular Value Thresholding (SVT) is developed to address this problem. Second, we analyse the temporal-spatial correlation and periodicity of long-term data, then propose a corresponding Long-Short Term Context Encoder (LSCE). As a variant of Generative Adversarial Networks (GANs) composed of Convolutional Neural Network (CNN) layers, our LSCE possesses the following novelties: (1) Data pre-processing converts the air pollution measurements into weekly image-like matrices to capture the similarity and periodicity of air pollutants; (2) CNN layers mine the temporal-spatial correlation from the incomplete data inputs to reconstruct a complete dataset; and (3) the GAN structure combines the generator with the discriminator, while using the joint loss function to improve recovery accuracy. Third, we study the weekday/weekend and seasonal variation of long-term air pollution data, while locating their local and non-local components. In response, we propose an Improved Long-Short Term Context Encoder (ILSCE) model, which benefits from three novelties: (1) A new CNN update mechanism, which allows the ILSCE to hierarchically recover data based on the non-empty inputs; (2) Periodicity labels of the inputs that allow the model to extract domain-specific features; and (3) Expert-led knowledge injected into the loss function to help the ILSCE capture both the local emissions and non-local background pollutants. In addition, based on LSCE and ILSCE, GCN-ST-MDR, a novel Graph Convolutional Network (GCN)-based framework, is proposed to identify the daily missing patterns and automatically select the best recovery method. GCN-ST-MDR presents three novelties: (1) New graph construction transforms the missing mask into a spatial-temporal (S-T) graph based on the similarity matrix to improve the extraction of GCN data representation for pattern identification; (2) Transfer Learning (TL) rapidly pre-trains the LSCE and ILSCE models; and (3) the GCN structure outputs a selection indicator to determine the dominating missing pattern for each data matrix input. The accuracy of the pre-trained data recovery models is subsequently incorporated into the loss function of the GCN component to penalise the incorrect indicator. SVT employs the low-rank property to recover short-term data. LSCE and ILSCE use the inherent nature of air pollution data to recover long-term data, while GCN-ST-MDR is developed to automatically assign the best of the two models to recover any missing patterns. Our results demonstrate that, regardless of data size and type, our data recovery framework can comprehensively and effectively recover missing air pollution data with higher accuracy, compared to state-of-the-art data recovery models. -
dc.languageeng-
dc.publisherThe University of Hong Kong (Pokfulam, Hong Kong)-
dc.relation.ispartofHKU Theses Online (HKUTO)-
dc.rightsThe author retains all proprietary rights, (such as patent rights) and the right to use in future works.-
dc.rightsThis work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.-
dc.subject.lcshAir quality monitoring stations-
dc.subject.lcshMissing observations (Statistics)-
dc.titleAI-driven identification and recovery of missing spatial-temporal data : a case study of missing data from air pollution monitoring stations-
dc.typePG_Thesis-
dc.description.thesisnameDoctor of Philosophy-
dc.description.thesislevelDoctoral-
dc.description.thesisdisciplineElectrical and Electronic Engineering-
dc.description.naturepublished_or_final_version-
dc.date.hkucongregation2022-
dc.identifier.mmsid991044600202803414-

Export via OAI-PMH Interface in XML Formats


OR


Export to Other Non-XML Formats