File Download
Supplementary

postgraduate thesis: Advancing model compression for resource-limited deep learning

TitleAdvancing model compression for resource-limited deep learning
Authors
Issue Date2024
PublisherThe University of Hong Kong (Pokfulam, Hong Kong)
Citation
Tao, C. [陶超凡]. (2024). Advancing model compression for resource-limited deep learning. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
AbstractThis thesis explores the compression and acceleration of neural networks, a vital area in deep learning that bridges the gap between advanced performance and practical implementation. The goal of this research is to make deep learning models more accessible and efficient for real-world applications across different tasks, where computational resources and storage are often limited. The work is structured into four main parts, discussing model efficiency through different model architectures on vision and language tasks. The first part of the thesis introduces a novel quantization pipeline that transforms network weights into the frequency domain to bridge the gap between full-precision and low-precision representations in convolutional neural networks (CNNs) which are representative network architectures for computer vision tasks. This approach simplifies the quantization process by learning quantization-friendly representations, enabling the achievement of good quantization results by simple quantizers. The effectiveness of this Frequency-Aware Transformation (FAT) framework is demonstrated through its ability to embed CNNs in low bit-widths, facilitating state-of-the-art performance across various model architectures with no special hardware requirements for the deployment. This second part addresses the challenges in compressing generative Pre-trained Language Models (PLMs). We investigate the reasons why conventional quantization methods underperform in this context, particularly issues like homogeneous word embeddings and varied weight distributions. To tackle these challenges, we propose a token-level contrastive distillation and a module-wise dynamic scaling, which regularizes the word embeddings to be inhomogeneous, and makes quantizers adaptive to different modules, respectively. The proposed method is applied to language modeling, summarization, and dialogue-related tasks, resulting in competitive compression rates without sacrificing much performance. The third part looks into the structured pruning of generative PLMs. We argue that the information in the hidden dimension is redundant based on the observation of the persistent outliers, offering the possibility to prune the hidden dimension without harming the performance. Then, a mask-based multi-dimension pruning method is designed to identify and eliminate redundant sub-networks in the model. This approach allows for flexible extraction of various-sized pruned PLMs. The method's effectiveness is validated through extensive experiments on language modeling, summarization and machine translation, obtaining at most 25$\times$ compression rate in model size with a slight performance drop and up to 9$\times$ improvements in inference speed. The fourth part of the thesis studies the robustness of quantization. It views robust quantization as an Online Domain Generalization Quantization (ODG-Q) task. This approach generates diverse adversarial data during training, enhancing the robustness of quantized networks against different types of attacks. ODG-Q offers a robust training mechanism for quantized and binary neural networks, improving defense against white-box and black-box attacks on typical datasets while maintaining comparable training costs to natural training processes. In conclusion, this thesis has advanced the compression and acceleration of neural networks in vision and language tasks, by introducing innovative techniques such as Frequency-Aware Transformation for CNNs quantization, the compression of generative PLMs and robust quantization strategies. These contributions not only enhance model efficiency and accessibility but also maintain competitive performance, paving the way for broader real-world applications of deep learning models under resource constraints.
DegreeDoctor of Philosophy
SubjectNeural networks (Computer science)
Deep learning (Machine learning)
Dept/ProgramElectrical and Electronic Engineering
Persistent Identifierhttp://hdl.handle.net/10722/351049

 

DC FieldValueLanguage
dc.contributor.authorTao, Chaofan-
dc.contributor.author陶超凡-
dc.date.accessioned2024-11-08T07:10:58Z-
dc.date.available2024-11-08T07:10:58Z-
dc.date.issued2024-
dc.identifier.citationTao, C. [陶超凡]. (2024). Advancing model compression for resource-limited deep learning. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.-
dc.identifier.urihttp://hdl.handle.net/10722/351049-
dc.description.abstractThis thesis explores the compression and acceleration of neural networks, a vital area in deep learning that bridges the gap between advanced performance and practical implementation. The goal of this research is to make deep learning models more accessible and efficient for real-world applications across different tasks, where computational resources and storage are often limited. The work is structured into four main parts, discussing model efficiency through different model architectures on vision and language tasks. The first part of the thesis introduces a novel quantization pipeline that transforms network weights into the frequency domain to bridge the gap between full-precision and low-precision representations in convolutional neural networks (CNNs) which are representative network architectures for computer vision tasks. This approach simplifies the quantization process by learning quantization-friendly representations, enabling the achievement of good quantization results by simple quantizers. The effectiveness of this Frequency-Aware Transformation (FAT) framework is demonstrated through its ability to embed CNNs in low bit-widths, facilitating state-of-the-art performance across various model architectures with no special hardware requirements for the deployment. This second part addresses the challenges in compressing generative Pre-trained Language Models (PLMs). We investigate the reasons why conventional quantization methods underperform in this context, particularly issues like homogeneous word embeddings and varied weight distributions. To tackle these challenges, we propose a token-level contrastive distillation and a module-wise dynamic scaling, which regularizes the word embeddings to be inhomogeneous, and makes quantizers adaptive to different modules, respectively. The proposed method is applied to language modeling, summarization, and dialogue-related tasks, resulting in competitive compression rates without sacrificing much performance. The third part looks into the structured pruning of generative PLMs. We argue that the information in the hidden dimension is redundant based on the observation of the persistent outliers, offering the possibility to prune the hidden dimension without harming the performance. Then, a mask-based multi-dimension pruning method is designed to identify and eliminate redundant sub-networks in the model. This approach allows for flexible extraction of various-sized pruned PLMs. The method's effectiveness is validated through extensive experiments on language modeling, summarization and machine translation, obtaining at most 25$\times$ compression rate in model size with a slight performance drop and up to 9$\times$ improvements in inference speed. The fourth part of the thesis studies the robustness of quantization. It views robust quantization as an Online Domain Generalization Quantization (ODG-Q) task. This approach generates diverse adversarial data during training, enhancing the robustness of quantized networks against different types of attacks. ODG-Q offers a robust training mechanism for quantized and binary neural networks, improving defense against white-box and black-box attacks on typical datasets while maintaining comparable training costs to natural training processes. In conclusion, this thesis has advanced the compression and acceleration of neural networks in vision and language tasks, by introducing innovative techniques such as Frequency-Aware Transformation for CNNs quantization, the compression of generative PLMs and robust quantization strategies. These contributions not only enhance model efficiency and accessibility but also maintain competitive performance, paving the way for broader real-world applications of deep learning models under resource constraints.-
dc.languageeng-
dc.publisherThe University of Hong Kong (Pokfulam, Hong Kong)-
dc.relation.ispartofHKU Theses Online (HKUTO)-
dc.rightsThe author retains all proprietary rights, (such as patent rights) and the right to use in future works.-
dc.rightsThis work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.-
dc.subject.lcshNeural networks (Computer science)-
dc.subject.lcshDeep learning (Machine learning)-
dc.titleAdvancing model compression for resource-limited deep learning-
dc.typePG_Thesis-
dc.description.thesisnameDoctor of Philosophy-
dc.description.thesislevelDoctoral-
dc.description.thesisdisciplineElectrical and Electronic Engineering-
dc.description.naturepublished_or_final_version-
dc.date.hkucongregation2024-
dc.identifier.mmsid991044869876403414-

Export via OAI-PMH Interface in XML Formats


OR


Export to Other Non-XML Formats