File Download
Supplementary

postgraduate thesis: Novel compression techniques for compact deep neural network design

TitleNovel compression techniques for compact deep neural network design
Authors
Advisors
Advisor(s):Chesi, GWong, N
Issue Date2022
PublisherThe University of Hong Kong (Pokfulam, Hong Kong)
Citation
Lin, R. [林睿]. (2022). Novel compression techniques for compact deep neural network design. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
AbstractDeep neural networks (DNNs) have achieved remarkable breakthroughs in various disciplines, such as classification, object detection, etc. Although the deeper structures and increased trainable parameters have successfully boosted the performance of DNNs, they inevitably bring about stringent challenges to deploying modern DNNs on edge devices with constrained hardware resources. This dilemma motivates the research on DNNs compression to obtain compact models that require low storage and achieve fast inference without sacrificing much accuracy. Existing compression approaches mainly fall into three categories: 1) low-rank decomposition, 2) pruning, and 3) quantization. This thesis explores these popular techniques and investigates another promising but under-explored direction, namely, sparse linear transform. Low-rank decomposition methods treat fully connected and convolutional layers as tensors, aiming to replace them with low-rank factors (viz., a sequence of smaller layers). However, existing techniques in this category invariably adopt a $4$-way view of the weights tensor, which impedes further compression. This thesis recognizes the unexploited rooms and proposes a method to further tensorize the input channel axis into smaller modes. Therefore, smaller kernels and higher compression ratios can be obtained after conducting decomposition on the newly generated higher-order tensor. Pruning has two sub-classes: weight and filter pruning. Compared with weight pruning which removes small weights in the kernel tensor, filter pruning eliminates entire filters, leading to structured sparsity and generic speedup irrespective of the software/hardware. Noticeably, most existing pruning schemes operate in the spatial domain, and information exploration in the frequency domain is relatively less. Therefore, this thesis connects a previously mysterious rank-based metric in the spatial domain to a novel, analytical view in the frequency domain. Along this route, an efficient Fast Fourier Transform (FFT)-based energy-zone metric is proposed to evaluate filters' importance from an innovative spectral perspective. Quantization approaches aim to utilize low-precision weights/activations to pursue high accuracy, thus reducing memory footprint and computation. Existing methods have developed complicated quantization strategies, e.g., mixed-precision and adaptive quantization levels, to achieve the goal. However, they have potential problems: 1) pushing full-precision values directly to their quantized representation can be suboptimal, 2) quantizing weights independently loses their correlations, and 3) approximating gradients can be inaccurate. Therefore, this thesis proposes a novel pipeline that removes the redundant information before quantization by considering the weights correlation in the frequency domain. Besides, the pipeline allows the gradients to be explicit. Therefore, even simple uniform quantizers can achieve impressive results when plugged into it. Compared with the above-mentioned categories, sparse and structured matrix factorization constitutes a new yet under-explored compression strategy. The limited amount of works in this category are all restrictive in the shape of weight matrices that can be factorized. Moreover, they aim at replacing only one or a few largest layers flattened in the GEMM setting, which may not yield significant compression. Subsequently, this thesis introduces a brand-new sparse linear transform that generalizes the conventional butterfly matrices, which can be adapted to variable input-output dimensions. The new framework inherits the fine-to-coarse-grained learnable hierarchy of traditional butterflies, obtaining more lightweight networks without compromising accuracy.
DegreeDoctor of Philosophy
SubjectNeural networks (Computer science)
Dept/ProgramElectrical and Electronic Engineering
Persistent Identifierhttp://hdl.handle.net/10722/322890

 

DC FieldValueLanguage
dc.contributor.advisorChesi, G-
dc.contributor.advisorWong, N-
dc.contributor.authorLin, Rui-
dc.contributor.author林睿-
dc.date.accessioned2022-11-18T10:41:30Z-
dc.date.available2022-11-18T10:41:30Z-
dc.date.issued2022-
dc.identifier.citationLin, R. [林睿]. (2022). Novel compression techniques for compact deep neural network design. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.-
dc.identifier.urihttp://hdl.handle.net/10722/322890-
dc.description.abstractDeep neural networks (DNNs) have achieved remarkable breakthroughs in various disciplines, such as classification, object detection, etc. Although the deeper structures and increased trainable parameters have successfully boosted the performance of DNNs, they inevitably bring about stringent challenges to deploying modern DNNs on edge devices with constrained hardware resources. This dilemma motivates the research on DNNs compression to obtain compact models that require low storage and achieve fast inference without sacrificing much accuracy. Existing compression approaches mainly fall into three categories: 1) low-rank decomposition, 2) pruning, and 3) quantization. This thesis explores these popular techniques and investigates another promising but under-explored direction, namely, sparse linear transform. Low-rank decomposition methods treat fully connected and convolutional layers as tensors, aiming to replace them with low-rank factors (viz., a sequence of smaller layers). However, existing techniques in this category invariably adopt a $4$-way view of the weights tensor, which impedes further compression. This thesis recognizes the unexploited rooms and proposes a method to further tensorize the input channel axis into smaller modes. Therefore, smaller kernels and higher compression ratios can be obtained after conducting decomposition on the newly generated higher-order tensor. Pruning has two sub-classes: weight and filter pruning. Compared with weight pruning which removes small weights in the kernel tensor, filter pruning eliminates entire filters, leading to structured sparsity and generic speedup irrespective of the software/hardware. Noticeably, most existing pruning schemes operate in the spatial domain, and information exploration in the frequency domain is relatively less. Therefore, this thesis connects a previously mysterious rank-based metric in the spatial domain to a novel, analytical view in the frequency domain. Along this route, an efficient Fast Fourier Transform (FFT)-based energy-zone metric is proposed to evaluate filters' importance from an innovative spectral perspective. Quantization approaches aim to utilize low-precision weights/activations to pursue high accuracy, thus reducing memory footprint and computation. Existing methods have developed complicated quantization strategies, e.g., mixed-precision and adaptive quantization levels, to achieve the goal. However, they have potential problems: 1) pushing full-precision values directly to their quantized representation can be suboptimal, 2) quantizing weights independently loses their correlations, and 3) approximating gradients can be inaccurate. Therefore, this thesis proposes a novel pipeline that removes the redundant information before quantization by considering the weights correlation in the frequency domain. Besides, the pipeline allows the gradients to be explicit. Therefore, even simple uniform quantizers can achieve impressive results when plugged into it. Compared with the above-mentioned categories, sparse and structured matrix factorization constitutes a new yet under-explored compression strategy. The limited amount of works in this category are all restrictive in the shape of weight matrices that can be factorized. Moreover, they aim at replacing only one or a few largest layers flattened in the GEMM setting, which may not yield significant compression. Subsequently, this thesis introduces a brand-new sparse linear transform that generalizes the conventional butterfly matrices, which can be adapted to variable input-output dimensions. The new framework inherits the fine-to-coarse-grained learnable hierarchy of traditional butterflies, obtaining more lightweight networks without compromising accuracy. -
dc.languageeng-
dc.publisherThe University of Hong Kong (Pokfulam, Hong Kong)-
dc.relation.ispartofHKU Theses Online (HKUTO)-
dc.rightsThe author retains all proprietary rights, (such as patent rights) and the right to use in future works.-
dc.rightsThis work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.-
dc.subject.lcshNeural networks (Computer science)-
dc.titleNovel compression techniques for compact deep neural network design-
dc.typePG_Thesis-
dc.description.thesisnameDoctor of Philosophy-
dc.description.thesislevelDoctoral-
dc.description.thesisdisciplineElectrical and Electronic Engineering-
dc.description.naturepublished_or_final_version-
dc.date.hkucongregation2022-
dc.identifier.mmsid991044609106103414-

Export via OAI-PMH Interface in XML Formats


OR


Export to Other Non-XML Formats