File Download
Supplementary

postgraduate thesis: Deep generative representation learning

TitleDeep generative representation learning
Authors
Issue Date2024
PublisherThe University of Hong Kong (Pokfulam, Hong Kong)
Citation
Guo, Q. [郭秋杉]. (2024). Deep generative representation learning. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
AbstractImages and natural language are very common in the real-world. However, generating highly artistic images and documents with world knowledge remains challenging, especially when dealing with both images and text as multimodal data. It is important to develop an efficient generative model that can handle multimodal data simultaneously. Moreover, the features of generative models exhibit good semantic properties, inspired by which, integrating discriminative and generative models into a single model would be a very meaningful task. This thesis aims to improve generative models from three aspects: rethinking diffusion-based generative models from a data perspective, unifying discriminative and generative models through probabilistic modeling, and data generation under multimodal data. Current deep generative models are data-driven, however, the training recipes are mostly handcrafted, which requires adaptation for new scenarios. Therefore, from a data perspective, we perform a comprehensive empirical analysis on the diffusion-based generative models. Based on our investigation, we introduce a novel metric, the Weighted Signal-to-Noise Ratio (WSNR), to consistently quantify noise levels across both RGB and latent spaces. This metric enables us to establish WSNR-Equivalent training noise schedules, significantly enhancing the performance of high-resolution models within these domains. Additionally, we delve into the reverse sampling process through an Ordinary Differential Equation (ODE) framework, shedding light on the data-driven sampling strategy. Finally, we propose an adaptive scheme to choose numerical methods within computational constraints, balancing efficacy and efficiency. Recent generative models show that their internal representation space is correlated with semantic concepts. Motivated by this, we propose to unify discriminative and generative models through probabilistic modeling. Precisely, we propose an energy-based classifier and generator, namely EGC, which can achieve superior performance in both tasks using a single neural network. Unlike conventional classifiers that produce a label given an image i.e., a conditional distribution p(y|x), the forward pass in EGC is a classification model that yields a joint distribution p(x,y), enabling a diffusion model in its backward pass by marginalizing out the label y to estimate the score function. Furthermore, EGC can be adapted for unsupervised learning by considering the label as latent variables. This work marks the inaugural success in mastering both domains using a unified network parameter set. We believe that EGC bridges the gap between discriminative and generative learning. In the real world application, most generative problems involve images and text. Vision language models (VLMs) have experienced rapid advancements through the integration of large language models, yet struggle with detailed regional visual understanding due to limited spatial awareness, and the use of coarse-grained region-specific training data. To address this, we introduce RegionGPT (RGPT), a novel framework designed for complex multimodal region-level captioning and understanding. RGPT enhances the spatial awareness of regional representation with simple yet effective modifications to existing visual encoders in VLMs. We demonstrate that a universal RGPT model can be effectively applied and significantly enhancing performance across a range of multimodal region-level tasks, including but not limited to complex region descriptions, reasoning, object classification, and referring expressions comprehension.
DegreeDoctor of Philosophy
SubjectMachine learning
Artificial intelligence
Dept/ProgramComputer Science
Persistent Identifierhttp://hdl.handle.net/10722/352675

 

DC FieldValueLanguage
dc.contributor.authorGuo, Qiushan-
dc.contributor.author郭秋杉-
dc.date.accessioned2024-12-19T09:27:10Z-
dc.date.available2024-12-19T09:27:10Z-
dc.date.issued2024-
dc.identifier.citationGuo, Q. [郭秋杉]. (2024). Deep generative representation learning. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.-
dc.identifier.urihttp://hdl.handle.net/10722/352675-
dc.description.abstractImages and natural language are very common in the real-world. However, generating highly artistic images and documents with world knowledge remains challenging, especially when dealing with both images and text as multimodal data. It is important to develop an efficient generative model that can handle multimodal data simultaneously. Moreover, the features of generative models exhibit good semantic properties, inspired by which, integrating discriminative and generative models into a single model would be a very meaningful task. This thesis aims to improve generative models from three aspects: rethinking diffusion-based generative models from a data perspective, unifying discriminative and generative models through probabilistic modeling, and data generation under multimodal data. Current deep generative models are data-driven, however, the training recipes are mostly handcrafted, which requires adaptation for new scenarios. Therefore, from a data perspective, we perform a comprehensive empirical analysis on the diffusion-based generative models. Based on our investigation, we introduce a novel metric, the Weighted Signal-to-Noise Ratio (WSNR), to consistently quantify noise levels across both RGB and latent spaces. This metric enables us to establish WSNR-Equivalent training noise schedules, significantly enhancing the performance of high-resolution models within these domains. Additionally, we delve into the reverse sampling process through an Ordinary Differential Equation (ODE) framework, shedding light on the data-driven sampling strategy. Finally, we propose an adaptive scheme to choose numerical methods within computational constraints, balancing efficacy and efficiency. Recent generative models show that their internal representation space is correlated with semantic concepts. Motivated by this, we propose to unify discriminative and generative models through probabilistic modeling. Precisely, we propose an energy-based classifier and generator, namely EGC, which can achieve superior performance in both tasks using a single neural network. Unlike conventional classifiers that produce a label given an image i.e., a conditional distribution p(y|x), the forward pass in EGC is a classification model that yields a joint distribution p(x,y), enabling a diffusion model in its backward pass by marginalizing out the label y to estimate the score function. Furthermore, EGC can be adapted for unsupervised learning by considering the label as latent variables. This work marks the inaugural success in mastering both domains using a unified network parameter set. We believe that EGC bridges the gap between discriminative and generative learning. In the real world application, most generative problems involve images and text. Vision language models (VLMs) have experienced rapid advancements through the integration of large language models, yet struggle with detailed regional visual understanding due to limited spatial awareness, and the use of coarse-grained region-specific training data. To address this, we introduce RegionGPT (RGPT), a novel framework designed for complex multimodal region-level captioning and understanding. RGPT enhances the spatial awareness of regional representation with simple yet effective modifications to existing visual encoders in VLMs. We demonstrate that a universal RGPT model can be effectively applied and significantly enhancing performance across a range of multimodal region-level tasks, including but not limited to complex region descriptions, reasoning, object classification, and referring expressions comprehension.-
dc.languageeng-
dc.publisherThe University of Hong Kong (Pokfulam, Hong Kong)-
dc.relation.ispartofHKU Theses Online (HKUTO)-
dc.rightsThe author retains all proprietary rights, (such as patent rights) and the right to use in future works.-
dc.rightsThis work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.-
dc.subject.lcshMachine learning-
dc.subject.lcshArtificial intelligence-
dc.titleDeep generative representation learning-
dc.typePG_Thesis-
dc.description.thesisnameDoctor of Philosophy-
dc.description.thesislevelDoctoral-
dc.description.thesisdisciplineComputer Science-
dc.description.naturepublished_or_final_version-
dc.date.hkucongregation2024-
dc.identifier.mmsid991044891406203414-

Export via OAI-PMH Interface in XML Formats


OR


Export to Other Non-XML Formats