Advancing model compression for resource-limited deep learning

Tao, Chaofan; 陶超凡

File Download

FullText.pdf

Supplementary

Citations:
Appears in Collections:
- HKU Theses Online
- Electrical & Electronic Engineering: Theses

postgraduate thesis: Advancing model compression for resource-limited deep learning

Title	Advancing model compression for resource-limited deep learning
Authors	Tao, Chaofan 陶超凡
Issue Date	2024
Publisher	The University of Hong Kong (Pokfulam, Hong Kong)
Citation	Tao, C. [陶超凡]. (2024). Advancing model compression for resource-limited deep learning. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
Abstract	This thesis explores the compression and acceleration of neural networks, a vital area in deep learning that bridges the gap between advanced performance and practical implementation. The goal of this research is to make deep learning models more accessible and efficient for real-world applications across different tasks, where computational resources and storage are often limited. The work is structured into four main parts, discussing model efficiency through different model architectures on vision and language tasks. The first part of the thesis introduces a novel quantization pipeline that transforms network weights into the frequency domain to bridge the gap between full-precision and low-precision representations in convolutional neural networks (CNNs) which are representative network architectures for computer vision tasks. This approach simplifies the quantization process by learning quantization-friendly representations, enabling the achievement of good quantization results by simple quantizers. The effectiveness of this Frequency-Aware Transformation (FAT) framework is demonstrated through its ability to embed CNNs in low bit-widths, facilitating state-of-the-art performance across various model architectures with no special hardware requirements for the deployment. This second part addresses the challenges in compressing generative Pre-trained Language Models (PLMs). We investigate the reasons why conventional quantization methods underperform in this context, particularly issues like homogeneous word embeddings and varied weight distributions. To tackle these challenges, we propose a token-level contrastive distillation and a module-wise dynamic scaling, which regularizes the word embeddings to be inhomogeneous, and makes quantizers adaptive to different modules, respectively. The proposed method is applied to language modeling, summarization, and dialogue-related tasks, resulting in competitive compression rates without sacrificing much performance. The third part looks into the structured pruning of generative PLMs. We argue that the information in the hidden dimension is redundant based on the observation of the persistent outliers, offering the possibility to prune the hidden dimension without harming the performance. Then, a mask-based multi-dimension pruning method is designed to identify and eliminate redundant sub-networks in the model. This approach allows for flexible extraction of various-sized pruned PLMs. The method's effectiveness is validated through extensive experiments on language modeling, summarization and machine translation, obtaining at most 25$\times$ compression rate in model size with a slight performance drop and up to 9$\times$ improvements in inference speed. The fourth part of the thesis studies the robustness of quantization. It views robust quantization as an Online Domain Generalization Quantization (ODG-Q) task. This approach generates diverse adversarial data during training, enhancing the robustness of quantized networks against different types of attacks. ODG-Q offers a robust training mechanism for quantized and binary neural networks, improving defense against white-box and black-box attacks on typical datasets while maintaining comparable training costs to natural training processes. In conclusion, this thesis has advanced the compression and acceleration of neural networks in vision and language tasks, by introducing innovative techniques such as Frequency-Aware Transformation for CNNs quantization, the compression of generative PLMs and robust quantization strategies. These contributions not only enhance model efficiency and accessibility but also maintain competitive performance, paving the way for broader real-world applications of deep learning models under resource constraints.
Degree	Doctor of Philosophy
Subject	Neural networks (Computer science) Deep learning (Machine learning)
Dept/Program	Electrical and Electronic Engineering
Persistent Identifier	http://hdl.handle.net/10722/351049

DC Field	Value	Language
dc.contributor.author	Tao, Chaofan	-
dc.contributor.author	陶超凡	-
dc.date.accessioned	2024-11-08T07:10:58Z	-
dc.date.available	2024-11-08T07:10:58Z	-
dc.date.issued	2024	-
dc.identifier.citation	Tao, C. [陶超凡]. (2024). Advancing model compression for resource-limited deep learning. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.	-
dc.identifier.uri	http://hdl.handle.net/10722/351049	-
dc.description.abstract	This thesis explores the compression and acceleration of neural networks, a vital area in deep learning that bridges the gap between advanced performance and practical implementation. The goal of this research is to make deep learning models more accessible and efficient for real-world applications across different tasks, where computational resources and storage are often limited. The work is structured into four main parts, discussing model efficiency through different model architectures on vision and language tasks. The first part of the thesis introduces a novel quantization pipeline that transforms network weights into the frequency domain to bridge the gap between full-precision and low-precision representations in convolutional neural networks (CNNs) which are representative network architectures for computer vision tasks. This approach simplifies the quantization process by learning quantization-friendly representations, enabling the achievement of good quantization results by simple quantizers. The effectiveness of this Frequency-Aware Transformation (FAT) framework is demonstrated through its ability to embed CNNs in low bit-widths, facilitating state-of-the-art performance across various model architectures with no special hardware requirements for the deployment. This second part addresses the challenges in compressing generative Pre-trained Language Models (PLMs). We investigate the reasons why conventional quantization methods underperform in this context, particularly issues like homogeneous word embeddings and varied weight distributions. To tackle these challenges, we propose a token-level contrastive distillation and a module-wise dynamic scaling, which regularizes the word embeddings to be inhomogeneous, and makes quantizers adaptive to different modules, respectively. The proposed method is applied to language modeling, summarization, and dialogue-related tasks, resulting in competitive compression rates without sacrificing much performance. The third part looks into the structured pruning of generative PLMs. We argue that the information in the hidden dimension is redundant based on the observation of the persistent outliers, offering the possibility to prune the hidden dimension without harming the performance. Then, a mask-based multi-dimension pruning method is designed to identify and eliminate redundant sub-networks in the model. This approach allows for flexible extraction of various-sized pruned PLMs. The method's effectiveness is validated through extensive experiments on language modeling, summarization and machine translation, obtaining at most 25$\times$ compression rate in model size with a slight performance drop and up to 9$\times$ improvements in inference speed. The fourth part of the thesis studies the robustness of quantization. It views robust quantization as an Online Domain Generalization Quantization (ODG-Q) task. This approach generates diverse adversarial data during training, enhancing the robustness of quantized networks against different types of attacks. ODG-Q offers a robust training mechanism for quantized and binary neural networks, improving defense against white-box and black-box attacks on typical datasets while maintaining comparable training costs to natural training processes. In conclusion, this thesis has advanced the compression and acceleration of neural networks in vision and language tasks, by introducing innovative techniques such as Frequency-Aware Transformation for CNNs quantization, the compression of generative PLMs and robust quantization strategies. These contributions not only enhance model efficiency and accessibility but also maintain competitive performance, paving the way for broader real-world applications of deep learning models under resource constraints.	-
dc.language	eng	-
dc.publisher	The University of Hong Kong (Pokfulam, Hong Kong)	-
dc.relation.ispartof	HKU Theses Online (HKUTO)	-
dc.rights	The author retains all proprietary rights, (such as patent rights) and the right to use in future works.	-
dc.rights	This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.	-
dc.subject.lcsh	Neural networks (Computer science)	-
dc.subject.lcsh	Deep learning (Machine learning)	-
dc.title	Advancing model compression for resource-limited deep learning	-
dc.type	PG_Thesis	-
dc.description.thesisname	Doctor of Philosophy	-
dc.description.thesislevel	Doctoral	-
dc.description.thesisdiscipline	Electrical and Electronic Engineering	-
dc.description.nature	published_or_final_version	-
dc.date.hkucongregation	2024	-
dc.identifier.mmsid	991044869876403414	-

File Download

Supplementary

postgraduate thesis: Advancing model compression for resource-limited deep learning

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats