File Download
Supplementary
-
Citations:
- Appears in Collections:
postgraduate thesis: Advancing model compression for resource-limited deep learning
Title | Advancing model compression for resource-limited deep learning |
---|---|
Authors | |
Issue Date | 2024 |
Publisher | The University of Hong Kong (Pokfulam, Hong Kong) |
Citation | Tao, C. [陶超凡]. (2024). Advancing model compression for resource-limited deep learning. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. |
Abstract | This thesis explores the compression and acceleration of neural networks, a vital area in deep learning that bridges the gap between advanced performance and practical implementation. The goal of this research is to make deep learning models more accessible and efficient for real-world applications across different tasks, where computational resources and storage are often limited. The work is structured into four main parts, discussing model efficiency through different model architectures on vision and language tasks.
The first part of the thesis introduces a novel quantization pipeline that transforms network weights into the frequency domain to bridge the gap between full-precision and low-precision representations in convolutional neural networks (CNNs) which are representative network architectures for computer vision tasks. This approach simplifies the quantization process by learning quantization-friendly representations, enabling the achievement of good quantization results by simple quantizers. The effectiveness of this Frequency-Aware Transformation (FAT) framework is demonstrated through its ability to embed CNNs in low bit-widths, facilitating state-of-the-art performance across various model architectures with no special hardware requirements for the deployment.
This second part addresses the challenges in compressing generative Pre-trained Language Models (PLMs). We investigate the reasons why conventional quantization methods underperform in this context, particularly issues like homogeneous word embeddings and varied weight distributions. To tackle these challenges, we propose a token-level contrastive distillation and a module-wise dynamic scaling, which regularizes the word embeddings to be inhomogeneous, and makes quantizers adaptive to different modules, respectively. The proposed method is applied to language modeling, summarization, and dialogue-related tasks, resulting in competitive compression rates without sacrificing much performance.
The third part looks into the structured pruning of generative PLMs. We argue that the information in the hidden dimension is redundant based on the observation of the persistent outliers, offering the possibility to prune the hidden dimension without harming the performance. Then, a mask-based multi-dimension pruning method is designed to identify and eliminate redundant sub-networks in the model. This approach allows for flexible extraction of various-sized pruned PLMs. The method's effectiveness is validated through extensive experiments on language modeling, summarization and machine translation, obtaining at most 25$\times$ compression rate in model size with a slight performance drop and up to 9$\times$ improvements in inference speed.
The fourth part of the thesis studies the robustness of quantization. It views robust quantization as an Online Domain Generalization Quantization (ODG-Q) task. This approach generates diverse adversarial data during training, enhancing the robustness of quantized networks against different types of attacks. ODG-Q offers a robust training mechanism for quantized and binary neural networks, improving defense against white-box and black-box attacks on typical datasets while maintaining comparable training costs to natural training processes.
In conclusion, this thesis has advanced the compression and acceleration of neural networks in vision and language tasks, by introducing innovative techniques such as Frequency-Aware Transformation for CNNs quantization, the compression of generative PLMs and robust quantization strategies. These contributions not only enhance model efficiency and accessibility but also maintain competitive performance, paving the way for broader real-world applications of deep learning models under resource constraints. |
Degree | Doctor of Philosophy |
Subject | Neural networks (Computer science) Deep learning (Machine learning) |
Dept/Program | Electrical and Electronic Engineering |
Persistent Identifier | http://hdl.handle.net/10722/351049 |
DC Field | Value | Language |
---|---|---|
dc.contributor.author | Tao, Chaofan | - |
dc.contributor.author | 陶超凡 | - |
dc.date.accessioned | 2024-11-08T07:10:58Z | - |
dc.date.available | 2024-11-08T07:10:58Z | - |
dc.date.issued | 2024 | - |
dc.identifier.citation | Tao, C. [陶超凡]. (2024). Advancing model compression for resource-limited deep learning. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. | - |
dc.identifier.uri | http://hdl.handle.net/10722/351049 | - |
dc.description.abstract | This thesis explores the compression and acceleration of neural networks, a vital area in deep learning that bridges the gap between advanced performance and practical implementation. The goal of this research is to make deep learning models more accessible and efficient for real-world applications across different tasks, where computational resources and storage are often limited. The work is structured into four main parts, discussing model efficiency through different model architectures on vision and language tasks. The first part of the thesis introduces a novel quantization pipeline that transforms network weights into the frequency domain to bridge the gap between full-precision and low-precision representations in convolutional neural networks (CNNs) which are representative network architectures for computer vision tasks. This approach simplifies the quantization process by learning quantization-friendly representations, enabling the achievement of good quantization results by simple quantizers. The effectiveness of this Frequency-Aware Transformation (FAT) framework is demonstrated through its ability to embed CNNs in low bit-widths, facilitating state-of-the-art performance across various model architectures with no special hardware requirements for the deployment. This second part addresses the challenges in compressing generative Pre-trained Language Models (PLMs). We investigate the reasons why conventional quantization methods underperform in this context, particularly issues like homogeneous word embeddings and varied weight distributions. To tackle these challenges, we propose a token-level contrastive distillation and a module-wise dynamic scaling, which regularizes the word embeddings to be inhomogeneous, and makes quantizers adaptive to different modules, respectively. The proposed method is applied to language modeling, summarization, and dialogue-related tasks, resulting in competitive compression rates without sacrificing much performance. The third part looks into the structured pruning of generative PLMs. We argue that the information in the hidden dimension is redundant based on the observation of the persistent outliers, offering the possibility to prune the hidden dimension without harming the performance. Then, a mask-based multi-dimension pruning method is designed to identify and eliminate redundant sub-networks in the model. This approach allows for flexible extraction of various-sized pruned PLMs. The method's effectiveness is validated through extensive experiments on language modeling, summarization and machine translation, obtaining at most 25$\times$ compression rate in model size with a slight performance drop and up to 9$\times$ improvements in inference speed. The fourth part of the thesis studies the robustness of quantization. It views robust quantization as an Online Domain Generalization Quantization (ODG-Q) task. This approach generates diverse adversarial data during training, enhancing the robustness of quantized networks against different types of attacks. ODG-Q offers a robust training mechanism for quantized and binary neural networks, improving defense against white-box and black-box attacks on typical datasets while maintaining comparable training costs to natural training processes. In conclusion, this thesis has advanced the compression and acceleration of neural networks in vision and language tasks, by introducing innovative techniques such as Frequency-Aware Transformation for CNNs quantization, the compression of generative PLMs and robust quantization strategies. These contributions not only enhance model efficiency and accessibility but also maintain competitive performance, paving the way for broader real-world applications of deep learning models under resource constraints. | - |
dc.language | eng | - |
dc.publisher | The University of Hong Kong (Pokfulam, Hong Kong) | - |
dc.relation.ispartof | HKU Theses Online (HKUTO) | - |
dc.rights | The author retains all proprietary rights, (such as patent rights) and the right to use in future works. | - |
dc.rights | This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. | - |
dc.subject.lcsh | Neural networks (Computer science) | - |
dc.subject.lcsh | Deep learning (Machine learning) | - |
dc.title | Advancing model compression for resource-limited deep learning | - |
dc.type | PG_Thesis | - |
dc.description.thesisname | Doctor of Philosophy | - |
dc.description.thesislevel | Doctoral | - |
dc.description.thesisdiscipline | Electrical and Electronic Engineering | - |
dc.description.nature | published_or_final_version | - |
dc.date.hkucongregation | 2024 | - |
dc.identifier.mmsid | 991044869876403414 | - |