File Download
Supplementary
-
Citations:
- Appears in Collections:
postgraduate thesis: Novel compression techniques for compact deep neural network design
Title | Novel compression techniques for compact deep neural network design |
---|---|
Authors | |
Advisors | |
Issue Date | 2022 |
Publisher | The University of Hong Kong (Pokfulam, Hong Kong) |
Citation | Lin, R. [林睿]. (2022). Novel compression techniques for compact deep neural network design. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. |
Abstract | Deep neural networks (DNNs) have achieved remarkable breakthroughs in various disciplines, such as classification, object detection, etc. Although the deeper structures and increased trainable parameters have successfully boosted the performance of DNNs, they inevitably bring about stringent challenges to deploying modern DNNs on edge devices with constrained hardware resources. This dilemma motivates the research on DNNs compression to obtain compact models that require low storage and achieve fast inference without sacrificing much accuracy. Existing compression approaches mainly fall into three categories: 1) low-rank decomposition, 2) pruning, and 3) quantization. This thesis explores these popular techniques and investigates another promising but under-explored direction, namely, sparse linear transform.
Low-rank decomposition methods treat fully connected and convolutional layers as tensors, aiming to replace them with low-rank factors (viz., a sequence of smaller layers). However, existing techniques in this category invariably adopt a $4$-way view of the weights tensor, which impedes further compression. This thesis recognizes the unexploited rooms and proposes a method to further tensorize the input channel axis into smaller modes. Therefore, smaller kernels and higher compression ratios can be obtained after conducting decomposition on the newly generated higher-order tensor.
Pruning has two sub-classes: weight and filter pruning. Compared with weight pruning which removes small weights in the kernel tensor, filter pruning eliminates entire filters, leading to structured sparsity and generic speedup irrespective of the software/hardware. Noticeably, most existing pruning schemes operate in the spatial domain, and information exploration in the frequency domain is relatively less. Therefore, this thesis connects a previously mysterious rank-based metric in the spatial domain to a novel, analytical view in the frequency domain. Along this route, an efficient Fast Fourier Transform (FFT)-based energy-zone metric is proposed to evaluate filters' importance from an innovative spectral perspective.
Quantization approaches aim to utilize low-precision weights/activations to pursue high accuracy, thus reducing memory footprint and computation. Existing methods have developed complicated quantization strategies, e.g., mixed-precision and adaptive quantization levels, to achieve the goal. However, they have potential problems: 1) pushing full-precision values directly to their quantized representation can be suboptimal, 2) quantizing weights independently loses their correlations, and 3) approximating gradients can be inaccurate. Therefore, this thesis proposes a novel pipeline that removes the redundant information before quantization by considering the weights correlation in the frequency domain. Besides, the pipeline allows the gradients to be explicit. Therefore, even simple uniform quantizers can achieve impressive results when plugged into it.
Compared with the above-mentioned categories, sparse and structured matrix factorization constitutes a new yet under-explored compression strategy. The limited amount of works in this category are all restrictive in the shape of weight matrices that can be factorized. Moreover, they aim at replacing only one or a few largest layers flattened in the GEMM setting, which may not yield significant compression. Subsequently, this thesis introduces a brand-new sparse linear transform that generalizes the conventional butterfly matrices, which can be adapted to variable input-output dimensions. The new framework inherits the fine-to-coarse-grained learnable hierarchy of traditional butterflies, obtaining more lightweight networks without compromising accuracy. |
Degree | Doctor of Philosophy |
Subject | Neural networks (Computer science) |
Dept/Program | Electrical and Electronic Engineering |
Persistent Identifier | http://hdl.handle.net/10722/322890 |
DC Field | Value | Language |
---|---|---|
dc.contributor.advisor | Chesi, G | - |
dc.contributor.advisor | Wong, N | - |
dc.contributor.author | Lin, Rui | - |
dc.contributor.author | 林睿 | - |
dc.date.accessioned | 2022-11-18T10:41:30Z | - |
dc.date.available | 2022-11-18T10:41:30Z | - |
dc.date.issued | 2022 | - |
dc.identifier.citation | Lin, R. [林睿]. (2022). Novel compression techniques for compact deep neural network design. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. | - |
dc.identifier.uri | http://hdl.handle.net/10722/322890 | - |
dc.description.abstract | Deep neural networks (DNNs) have achieved remarkable breakthroughs in various disciplines, such as classification, object detection, etc. Although the deeper structures and increased trainable parameters have successfully boosted the performance of DNNs, they inevitably bring about stringent challenges to deploying modern DNNs on edge devices with constrained hardware resources. This dilemma motivates the research on DNNs compression to obtain compact models that require low storage and achieve fast inference without sacrificing much accuracy. Existing compression approaches mainly fall into three categories: 1) low-rank decomposition, 2) pruning, and 3) quantization. This thesis explores these popular techniques and investigates another promising but under-explored direction, namely, sparse linear transform. Low-rank decomposition methods treat fully connected and convolutional layers as tensors, aiming to replace them with low-rank factors (viz., a sequence of smaller layers). However, existing techniques in this category invariably adopt a $4$-way view of the weights tensor, which impedes further compression. This thesis recognizes the unexploited rooms and proposes a method to further tensorize the input channel axis into smaller modes. Therefore, smaller kernels and higher compression ratios can be obtained after conducting decomposition on the newly generated higher-order tensor. Pruning has two sub-classes: weight and filter pruning. Compared with weight pruning which removes small weights in the kernel tensor, filter pruning eliminates entire filters, leading to structured sparsity and generic speedup irrespective of the software/hardware. Noticeably, most existing pruning schemes operate in the spatial domain, and information exploration in the frequency domain is relatively less. Therefore, this thesis connects a previously mysterious rank-based metric in the spatial domain to a novel, analytical view in the frequency domain. Along this route, an efficient Fast Fourier Transform (FFT)-based energy-zone metric is proposed to evaluate filters' importance from an innovative spectral perspective. Quantization approaches aim to utilize low-precision weights/activations to pursue high accuracy, thus reducing memory footprint and computation. Existing methods have developed complicated quantization strategies, e.g., mixed-precision and adaptive quantization levels, to achieve the goal. However, they have potential problems: 1) pushing full-precision values directly to their quantized representation can be suboptimal, 2) quantizing weights independently loses their correlations, and 3) approximating gradients can be inaccurate. Therefore, this thesis proposes a novel pipeline that removes the redundant information before quantization by considering the weights correlation in the frequency domain. Besides, the pipeline allows the gradients to be explicit. Therefore, even simple uniform quantizers can achieve impressive results when plugged into it. Compared with the above-mentioned categories, sparse and structured matrix factorization constitutes a new yet under-explored compression strategy. The limited amount of works in this category are all restrictive in the shape of weight matrices that can be factorized. Moreover, they aim at replacing only one or a few largest layers flattened in the GEMM setting, which may not yield significant compression. Subsequently, this thesis introduces a brand-new sparse linear transform that generalizes the conventional butterfly matrices, which can be adapted to variable input-output dimensions. The new framework inherits the fine-to-coarse-grained learnable hierarchy of traditional butterflies, obtaining more lightweight networks without compromising accuracy. | - |
dc.language | eng | - |
dc.publisher | The University of Hong Kong (Pokfulam, Hong Kong) | - |
dc.relation.ispartof | HKU Theses Online (HKUTO) | - |
dc.rights | The author retains all proprietary rights, (such as patent rights) and the right to use in future works. | - |
dc.rights | This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. | - |
dc.subject.lcsh | Neural networks (Computer science) | - |
dc.title | Novel compression techniques for compact deep neural network design | - |
dc.type | PG_Thesis | - |
dc.description.thesisname | Doctor of Philosophy | - |
dc.description.thesislevel | Doctoral | - |
dc.description.thesisdiscipline | Electrical and Electronic Engineering | - |
dc.description.nature | published_or_final_version | - |
dc.date.hkucongregation | 2022 | - |
dc.identifier.mmsid | 991044609106103414 | - |