Novel compression techniques for compact deep neural network design

Lin, Rui; 林睿

File Download

FullText.pdf

Supplementary

Citations:
Appears in Collections:
- HKU Theses Online
- Electrical & Electronic Engineering: Theses

postgraduate thesis: Novel compression techniques for compact deep neural network design

Title	Novel compression techniques for compact deep neural network design
Authors	Lin, Rui 林睿
Advisors	Advisor(s):Chesi, G Wong, N
Issue Date	2022
Publisher	The University of Hong Kong (Pokfulam, Hong Kong)
Citation	Lin, R. [林睿]. (2022). Novel compression techniques for compact deep neural network design. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
Abstract	Deep neural networks (DNNs) have achieved remarkable breakthroughs in various disciplines, such as classification, object detection, etc. Although the deeper structures and increased trainable parameters have successfully boosted the performance of DNNs, they inevitably bring about stringent challenges to deploying modern DNNs on edge devices with constrained hardware resources. This dilemma motivates the research on DNNs compression to obtain compact models that require low storage and achieve fast inference without sacrificing much accuracy. Existing compression approaches mainly fall into three categories: 1) low-rank decomposition, 2) pruning, and 3) quantization. This thesis explores these popular techniques and investigates another promising but under-explored direction, namely, sparse linear transform. Low-rank decomposition methods treat fully connected and convolutional layers as tensors, aiming to replace them with low-rank factors (viz., a sequence of smaller layers). However, existing techniques in this category invariably adopt a $4$-way view of the weights tensor, which impedes further compression. This thesis recognizes the unexploited rooms and proposes a method to further tensorize the input channel axis into smaller modes. Therefore, smaller kernels and higher compression ratios can be obtained after conducting decomposition on the newly generated higher-order tensor. Pruning has two sub-classes: weight and filter pruning. Compared with weight pruning which removes small weights in the kernel tensor, filter pruning eliminates entire filters, leading to structured sparsity and generic speedup irrespective of the software/hardware. Noticeably, most existing pruning schemes operate in the spatial domain, and information exploration in the frequency domain is relatively less. Therefore, this thesis connects a previously mysterious rank-based metric in the spatial domain to a novel, analytical view in the frequency domain. Along this route, an efficient Fast Fourier Transform (FFT)-based energy-zone metric is proposed to evaluate filters' importance from an innovative spectral perspective. Quantization approaches aim to utilize low-precision weights/activations to pursue high accuracy, thus reducing memory footprint and computation. Existing methods have developed complicated quantization strategies, e.g., mixed-precision and adaptive quantization levels, to achieve the goal. However, they have potential problems: 1) pushing full-precision values directly to their quantized representation can be suboptimal, 2) quantizing weights independently loses their correlations, and 3) approximating gradients can be inaccurate. Therefore, this thesis proposes a novel pipeline that removes the redundant information before quantization by considering the weights correlation in the frequency domain. Besides, the pipeline allows the gradients to be explicit. Therefore, even simple uniform quantizers can achieve impressive results when plugged into it. Compared with the above-mentioned categories, sparse and structured matrix factorization constitutes a new yet under-explored compression strategy. The limited amount of works in this category are all restrictive in the shape of weight matrices that can be factorized. Moreover, they aim at replacing only one or a few largest layers flattened in the GEMM setting, which may not yield significant compression. Subsequently, this thesis introduces a brand-new sparse linear transform that generalizes the conventional butterfly matrices, which can be adapted to variable input-output dimensions. The new framework inherits the fine-to-coarse-grained learnable hierarchy of traditional butterflies, obtaining more lightweight networks without compromising accuracy.
Degree	Doctor of Philosophy
Subject	Neural networks (Computer science)
Dept/Program	Electrical and Electronic Engineering
Persistent Identifier	http://hdl.handle.net/10722/322890

DC Field	Value	Language
dc.contributor.advisor	Chesi, G	-
dc.contributor.advisor	Wong, N	-
dc.contributor.author	Lin, Rui	-
dc.contributor.author	林睿	-
dc.date.accessioned	2022-11-18T10:41:30Z	-
dc.date.available	2022-11-18T10:41:30Z	-
dc.date.issued	2022	-
dc.identifier.citation	Lin, R. [林睿]. (2022). Novel compression techniques for compact deep neural network design. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.	-
dc.identifier.uri	http://hdl.handle.net/10722/322890	-
dc.description.abstract	Deep neural networks (DNNs) have achieved remarkable breakthroughs in various disciplines, such as classification, object detection, etc. Although the deeper structures and increased trainable parameters have successfully boosted the performance of DNNs, they inevitably bring about stringent challenges to deploying modern DNNs on edge devices with constrained hardware resources. This dilemma motivates the research on DNNs compression to obtain compact models that require low storage and achieve fast inference without sacrificing much accuracy. Existing compression approaches mainly fall into three categories: 1) low-rank decomposition, 2) pruning, and 3) quantization. This thesis explores these popular techniques and investigates another promising but under-explored direction, namely, sparse linear transform. Low-rank decomposition methods treat fully connected and convolutional layers as tensors, aiming to replace them with low-rank factors (viz., a sequence of smaller layers). However, existing techniques in this category invariably adopt a $4$-way view of the weights tensor, which impedes further compression. This thesis recognizes the unexploited rooms and proposes a method to further tensorize the input channel axis into smaller modes. Therefore, smaller kernels and higher compression ratios can be obtained after conducting decomposition on the newly generated higher-order tensor. Pruning has two sub-classes: weight and filter pruning. Compared with weight pruning which removes small weights in the kernel tensor, filter pruning eliminates entire filters, leading to structured sparsity and generic speedup irrespective of the software/hardware. Noticeably, most existing pruning schemes operate in the spatial domain, and information exploration in the frequency domain is relatively less. Therefore, this thesis connects a previously mysterious rank-based metric in the spatial domain to a novel, analytical view in the frequency domain. Along this route, an efficient Fast Fourier Transform (FFT)-based energy-zone metric is proposed to evaluate filters' importance from an innovative spectral perspective. Quantization approaches aim to utilize low-precision weights/activations to pursue high accuracy, thus reducing memory footprint and computation. Existing methods have developed complicated quantization strategies, e.g., mixed-precision and adaptive quantization levels, to achieve the goal. However, they have potential problems: 1) pushing full-precision values directly to their quantized representation can be suboptimal, 2) quantizing weights independently loses their correlations, and 3) approximating gradients can be inaccurate. Therefore, this thesis proposes a novel pipeline that removes the redundant information before quantization by considering the weights correlation in the frequency domain. Besides, the pipeline allows the gradients to be explicit. Therefore, even simple uniform quantizers can achieve impressive results when plugged into it. Compared with the above-mentioned categories, sparse and structured matrix factorization constitutes a new yet under-explored compression strategy. The limited amount of works in this category are all restrictive in the shape of weight matrices that can be factorized. Moreover, they aim at replacing only one or a few largest layers flattened in the GEMM setting, which may not yield significant compression. Subsequently, this thesis introduces a brand-new sparse linear transform that generalizes the conventional butterfly matrices, which can be adapted to variable input-output dimensions. The new framework inherits the fine-to-coarse-grained learnable hierarchy of traditional butterflies, obtaining more lightweight networks without compromising accuracy.	-
dc.language	eng	-
dc.publisher	The University of Hong Kong (Pokfulam, Hong Kong)	-
dc.relation.ispartof	HKU Theses Online (HKUTO)	-
dc.rights	The author retains all proprietary rights, (such as patent rights) and the right to use in future works.	-
dc.rights	This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.	-
dc.subject.lcsh	Neural networks (Computer science)	-
dc.title	Novel compression techniques for compact deep neural network design	-
dc.type	PG_Thesis	-
dc.description.thesisname	Doctor of Philosophy	-
dc.description.thesislevel	Doctoral	-
dc.description.thesisdiscipline	Electrical and Electronic Engineering	-
dc.description.nature	published_or_final_version	-
dc.date.hkucongregation	2022	-
dc.identifier.mmsid	991044609106103414	-

File Download

Supplementary

postgraduate thesis: Novel compression techniques for compact deep neural network design

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats