Deep generative representation learning

Guo, Qiushan; 郭秋杉

File Download

FullText.pdf

Supplementary

Citations:
Appears in Collections:
- HKU Theses Online
- Computer Science: Theses

postgraduate thesis: Deep generative representation learning

Title	Deep generative representation learning
Authors	Guo, Qiushan 郭秋杉
Issue Date	2024
Publisher	The University of Hong Kong (Pokfulam, Hong Kong)
Citation	Guo, Q. [郭秋杉]. (2024). Deep generative representation learning. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
Abstract	Images and natural language are very common in the real-world. However, generating highly artistic images and documents with world knowledge remains challenging, especially when dealing with both images and text as multimodal data. It is important to develop an efficient generative model that can handle multimodal data simultaneously. Moreover, the features of generative models exhibit good semantic properties, inspired by which, integrating discriminative and generative models into a single model would be a very meaningful task. This thesis aims to improve generative models from three aspects: rethinking diffusion-based generative models from a data perspective, unifying discriminative and generative models through probabilistic modeling, and data generation under multimodal data. Current deep generative models are data-driven, however, the training recipes are mostly handcrafted, which requires adaptation for new scenarios. Therefore, from a data perspective, we perform a comprehensive empirical analysis on the diffusion-based generative models. Based on our investigation, we introduce a novel metric, the Weighted Signal-to-Noise Ratio (WSNR), to consistently quantify noise levels across both RGB and latent spaces. This metric enables us to establish WSNR-Equivalent training noise schedules, significantly enhancing the performance of high-resolution models within these domains. Additionally, we delve into the reverse sampling process through an Ordinary Differential Equation (ODE) framework, shedding light on the data-driven sampling strategy. Finally, we propose an adaptive scheme to choose numerical methods within computational constraints, balancing efficacy and efficiency. Recent generative models show that their internal representation space is correlated with semantic concepts. Motivated by this, we propose to unify discriminative and generative models through probabilistic modeling. Precisely, we propose an energy-based classifier and generator, namely EGC, which can achieve superior performance in both tasks using a single neural network. Unlike conventional classifiers that produce a label given an image i.e., a conditional distribution p(y\|x), the forward pass in EGC is a classification model that yields a joint distribution p(x,y), enabling a diffusion model in its backward pass by marginalizing out the label y to estimate the score function. Furthermore, EGC can be adapted for unsupervised learning by considering the label as latent variables. This work marks the inaugural success in mastering both domains using a unified network parameter set. We believe that EGC bridges the gap between discriminative and generative learning. In the real world application, most generative problems involve images and text. Vision language models (VLMs) have experienced rapid advancements through the integration of large language models, yet struggle with detailed regional visual understanding due to limited spatial awareness, and the use of coarse-grained region-specific training data. To address this, we introduce RegionGPT (RGPT), a novel framework designed for complex multimodal region-level captioning and understanding. RGPT enhances the spatial awareness of regional representation with simple yet effective modifications to existing visual encoders in VLMs. We demonstrate that a universal RGPT model can be effectively applied and significantly enhancing performance across a range of multimodal region-level tasks, including but not limited to complex region descriptions, reasoning, object classification, and referring expressions comprehension.
Degree	Doctor of Philosophy
Subject	Machine learning Artificial intelligence
Dept/Program	Computer Science
Persistent Identifier	http://hdl.handle.net/10722/352675

DC Field	Value	Language
dc.contributor.author	Guo, Qiushan	-
dc.contributor.author	郭秋杉	-
dc.date.accessioned	2024-12-19T09:27:10Z	-
dc.date.available	2024-12-19T09:27:10Z	-
dc.date.issued	2024	-
dc.identifier.citation	Guo, Q. [郭秋杉]. (2024). Deep generative representation learning. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.	-
dc.identifier.uri	http://hdl.handle.net/10722/352675	-
dc.description.abstract	Images and natural language are very common in the real-world. However, generating highly artistic images and documents with world knowledge remains challenging, especially when dealing with both images and text as multimodal data. It is important to develop an efficient generative model that can handle multimodal data simultaneously. Moreover, the features of generative models exhibit good semantic properties, inspired by which, integrating discriminative and generative models into a single model would be a very meaningful task. This thesis aims to improve generative models from three aspects: rethinking diffusion-based generative models from a data perspective, unifying discriminative and generative models through probabilistic modeling, and data generation under multimodal data. Current deep generative models are data-driven, however, the training recipes are mostly handcrafted, which requires adaptation for new scenarios. Therefore, from a data perspective, we perform a comprehensive empirical analysis on the diffusion-based generative models. Based on our investigation, we introduce a novel metric, the Weighted Signal-to-Noise Ratio (WSNR), to consistently quantify noise levels across both RGB and latent spaces. This metric enables us to establish WSNR-Equivalent training noise schedules, significantly enhancing the performance of high-resolution models within these domains. Additionally, we delve into the reverse sampling process through an Ordinary Differential Equation (ODE) framework, shedding light on the data-driven sampling strategy. Finally, we propose an adaptive scheme to choose numerical methods within computational constraints, balancing efficacy and efficiency. Recent generative models show that their internal representation space is correlated with semantic concepts. Motivated by this, we propose to unify discriminative and generative models through probabilistic modeling. Precisely, we propose an energy-based classifier and generator, namely EGC, which can achieve superior performance in both tasks using a single neural network. Unlike conventional classifiers that produce a label given an image i.e., a conditional distribution p(y\|x), the forward pass in EGC is a classification model that yields a joint distribution p(x,y), enabling a diffusion model in its backward pass by marginalizing out the label y to estimate the score function. Furthermore, EGC can be adapted for unsupervised learning by considering the label as latent variables. This work marks the inaugural success in mastering both domains using a unified network parameter set. We believe that EGC bridges the gap between discriminative and generative learning. In the real world application, most generative problems involve images and text. Vision language models (VLMs) have experienced rapid advancements through the integration of large language models, yet struggle with detailed regional visual understanding due to limited spatial awareness, and the use of coarse-grained region-specific training data. To address this, we introduce RegionGPT (RGPT), a novel framework designed for complex multimodal region-level captioning and understanding. RGPT enhances the spatial awareness of regional representation with simple yet effective modifications to existing visual encoders in VLMs. We demonstrate that a universal RGPT model can be effectively applied and significantly enhancing performance across a range of multimodal region-level tasks, including but not limited to complex region descriptions, reasoning, object classification, and referring expressions comprehension.	-
dc.language	eng	-
dc.publisher	The University of Hong Kong (Pokfulam, Hong Kong)	-
dc.relation.ispartof	HKU Theses Online (HKUTO)	-
dc.rights	The author retains all proprietary rights, (such as patent rights) and the right to use in future works.	-
dc.rights	This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.	-
dc.subject.lcsh	Machine learning	-
dc.subject.lcsh	Artificial intelligence	-
dc.title	Deep generative representation learning	-
dc.type	PG_Thesis	-
dc.description.thesisname	Doctor of Philosophy	-
dc.description.thesislevel	Doctoral	-
dc.description.thesisdiscipline	Computer Science	-
dc.description.nature	published_or_final_version	-
dc.date.hkucongregation	2024	-
dc.identifier.mmsid	991044891406203414	-

File Download

Supplementary

postgraduate thesis: Deep generative representation learning

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats