AI-driven identification and recovery of missing spatial-temporal data : a case study of missing data from air pollution monitoring stations

Yu, Yangwen; 于洋文

File Download

FullText.pdf

Supplementary

Citations:
Appears in Collections:
- HKU Theses Online
- Electrical & Electronic Engineering: Theses

postgraduate thesis: AI-driven identification and recovery of missing spatial-temporal data : a case study of missing data from air pollution monitoring stations

Title	AI-driven identification and recovery of missing spatial-temporal data : a case study of missing data from air pollution monitoring stations
Authors	Yu, Yangwen 于洋文
Advisors	Advisor(s):Li, VOK Lam, JCK
Issue Date	2022
Publisher	The University of Hong Kong (Pokfulam, Hong Kong)
Citation	Yu, Y. [于洋文]. (2022). AI-driven identification and recovery of missing spatial-temporal data : a case study of missing data from air pollution monitoring stations. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
Abstract	Data loss, due to equipment and transmission failures, presents a key challenge to air quality monitoring. Since existing data imputation methods fail to capture the specific characteristics of missing air pollution data, the effective recovery of this data remains an unresolved and critical challenge. Accordingly, this study aims to analyse these characteristics and propose corresponding effective recovery methods. First, we investigate the low-rank property of short-term air pollution data and transform the missing recovery problem into a Low-Rank Matrix Completion (LRMC) problem. Based on duality theory, the formulated LRMC problem is further converted into a sub-gradient primal-dual problem, while Singular Value Thresholding (SVT) is developed to address this problem. Second, we analyse the temporal-spatial correlation and periodicity of long-term data, then propose a corresponding Long-Short Term Context Encoder (LSCE). As a variant of Generative Adversarial Networks (GANs) composed of Convolutional Neural Network (CNN) layers, our LSCE possesses the following novelties: (1) Data pre-processing converts the air pollution measurements into weekly image-like matrices to capture the similarity and periodicity of air pollutants; (2) CNN layers mine the temporal-spatial correlation from the incomplete data inputs to reconstruct a complete dataset; and (3) the GAN structure combines the generator with the discriminator, while using the joint loss function to improve recovery accuracy. Third, we study the weekday/weekend and seasonal variation of long-term air pollution data, while locating their local and non-local components. In response, we propose an Improved Long-Short Term Context Encoder (ILSCE) model, which benefits from three novelties: (1) A new CNN update mechanism, which allows the ILSCE to hierarchically recover data based on the non-empty inputs; (2) Periodicity labels of the inputs that allow the model to extract domain-specific features; and (3) Expert-led knowledge injected into the loss function to help the ILSCE capture both the local emissions and non-local background pollutants. In addition, based on LSCE and ILSCE, GCN-ST-MDR, a novel Graph Convolutional Network (GCN)-based framework, is proposed to identify the daily missing patterns and automatically select the best recovery method. GCN-ST-MDR presents three novelties: (1) New graph construction transforms the missing mask into a spatial-temporal (S-T) graph based on the similarity matrix to improve the extraction of GCN data representation for pattern identification; (2) Transfer Learning (TL) rapidly pre-trains the LSCE and ILSCE models; and (3) the GCN structure outputs a selection indicator to determine the dominating missing pattern for each data matrix input. The accuracy of the pre-trained data recovery models is subsequently incorporated into the loss function of the GCN component to penalise the incorrect indicator. SVT employs the low-rank property to recover short-term data. LSCE and ILSCE use the inherent nature of air pollution data to recover long-term data, while GCN-ST-MDR is developed to automatically assign the best of the two models to recover any missing patterns. Our results demonstrate that, regardless of data size and type, our data recovery framework can comprehensively and effectively recover missing air pollution data with higher accuracy, compared to state-of-the-art data recovery models.
Degree	Doctor of Philosophy
Subject	Air quality monitoring stations Missing observations (Statistics)
Dept/Program	Electrical and Electronic Engineering
Persistent Identifier	http://hdl.handle.net/10722/318425

DC Field	Value	Language
dc.contributor.advisor	Li, VOK	-
dc.contributor.advisor	Lam, JCK	-
dc.contributor.author	Yu, Yangwen	-
dc.contributor.author	于洋文	-
dc.date.accessioned	2022-10-10T08:18:57Z	-
dc.date.available	2022-10-10T08:18:57Z	-
dc.date.issued	2022	-
dc.identifier.citation	Yu, Y. [于洋文]. (2022). AI-driven identification and recovery of missing spatial-temporal data : a case study of missing data from air pollution monitoring stations. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.	-
dc.identifier.uri	http://hdl.handle.net/10722/318425	-
dc.description.abstract	Data loss, due to equipment and transmission failures, presents a key challenge to air quality monitoring. Since existing data imputation methods fail to capture the specific characteristics of missing air pollution data, the effective recovery of this data remains an unresolved and critical challenge. Accordingly, this study aims to analyse these characteristics and propose corresponding effective recovery methods. First, we investigate the low-rank property of short-term air pollution data and transform the missing recovery problem into a Low-Rank Matrix Completion (LRMC) problem. Based on duality theory, the formulated LRMC problem is further converted into a sub-gradient primal-dual problem, while Singular Value Thresholding (SVT) is developed to address this problem. Second, we analyse the temporal-spatial correlation and periodicity of long-term data, then propose a corresponding Long-Short Term Context Encoder (LSCE). As a variant of Generative Adversarial Networks (GANs) composed of Convolutional Neural Network (CNN) layers, our LSCE possesses the following novelties: (1) Data pre-processing converts the air pollution measurements into weekly image-like matrices to capture the similarity and periodicity of air pollutants; (2) CNN layers mine the temporal-spatial correlation from the incomplete data inputs to reconstruct a complete dataset; and (3) the GAN structure combines the generator with the discriminator, while using the joint loss function to improve recovery accuracy. Third, we study the weekday/weekend and seasonal variation of long-term air pollution data, while locating their local and non-local components. In response, we propose an Improved Long-Short Term Context Encoder (ILSCE) model, which benefits from three novelties: (1) A new CNN update mechanism, which allows the ILSCE to hierarchically recover data based on the non-empty inputs; (2) Periodicity labels of the inputs that allow the model to extract domain-specific features; and (3) Expert-led knowledge injected into the loss function to help the ILSCE capture both the local emissions and non-local background pollutants. In addition, based on LSCE and ILSCE, GCN-ST-MDR, a novel Graph Convolutional Network (GCN)-based framework, is proposed to identify the daily missing patterns and automatically select the best recovery method. GCN-ST-MDR presents three novelties: (1) New graph construction transforms the missing mask into a spatial-temporal (S-T) graph based on the similarity matrix to improve the extraction of GCN data representation for pattern identification; (2) Transfer Learning (TL) rapidly pre-trains the LSCE and ILSCE models; and (3) the GCN structure outputs a selection indicator to determine the dominating missing pattern for each data matrix input. The accuracy of the pre-trained data recovery models is subsequently incorporated into the loss function of the GCN component to penalise the incorrect indicator. SVT employs the low-rank property to recover short-term data. LSCE and ILSCE use the inherent nature of air pollution data to recover long-term data, while GCN-ST-MDR is developed to automatically assign the best of the two models to recover any missing patterns. Our results demonstrate that, regardless of data size and type, our data recovery framework can comprehensively and effectively recover missing air pollution data with higher accuracy, compared to state-of-the-art data recovery models.	-
dc.language	eng	-
dc.publisher	The University of Hong Kong (Pokfulam, Hong Kong)	-
dc.relation.ispartof	HKU Theses Online (HKUTO)	-
dc.rights	The author retains all proprietary rights, (such as patent rights) and the right to use in future works.	-
dc.rights	This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.	-
dc.subject.lcsh	Air quality monitoring stations	-
dc.subject.lcsh	Missing observations (Statistics)	-
dc.title	AI-driven identification and recovery of missing spatial-temporal data : a case study of missing data from air pollution monitoring stations	-
dc.type	PG_Thesis	-
dc.description.thesisname	Doctor of Philosophy	-
dc.description.thesislevel	Doctoral	-
dc.description.thesisdiscipline	Electrical and Electronic Engineering	-
dc.description.nature	published_or_final_version	-
dc.date.hkucongregation	2022	-
dc.identifier.mmsid	991044600202803414	-

File Download

Supplementary

postgraduate thesis: AI-driven identification and recovery of missing spatial-temporal data : a case study of missing data from air pollution monitoring stations

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats