Domain-specific FPGA overlay : an architecture-compilation co-design methodology

Shi, Runbin; 石潤彬

File Download

FullText.pdf

Supplementary

Citations:
Appears in Collections:
- HKU Theses Online
- Electrical & Electronic Engineering: Theses

postgraduate thesis: Domain-specific FPGA overlay : an architecture-compilation co-design methodology

Title	Domain-specific FPGA overlay : an architecture-compilation co-design methodology
Authors	Shi, Runbin 石潤彬
Advisors	Advisor(s):So, HKH Lam, EYM
Issue Date	2020
Publisher	The University of Hong Kong (Pokfulam, Hong Kong)
Citation	Shi, R. [石潤彬]. (2020). Domain-specific FPGA overlay : an architecture-compilation co-design methodology. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
Abstract	From smartwatch in the edge to data center in the cloud, computations take place everywhere. However, with the end of Moore’s Law in sight, the demand for computing shows explosive growth in recent years. Existing systems that are based on distributed processing or by using general-purpose graph processing unit (GPGPU) for acceleration have demonstrated great success in a few domains such as deep learning and computational protein design. Nevertheless, these general platforms suffer low power efficiency and resource utilization in many cases due to the conflict between generalization and specialization. Customized computing with field-programmable gate array (FPGA) is a promising direction for future parallel computing that benefits both efficiency and hardware cost. However, the FPGA (hardware) design flow is fundamentally different from that of software development. Compared to the general processors, FPGAs have much more fine-grained programmable units and a much larger parallelism scale that introduce great difficulties to the end users. In the decades of investigation on the FPGA design method, overlay has been a very promising form that greatly narrows the gap between workloads and physical architecture. The overlay provides a virtual architecture that adapts to a group of applications and a compilation tool that translates the workload to soft control instructions or hardware with workload-specific configurations. This thesis follows the overlay research routine and pays more attention to the challenges in domain-specific FPGA overlay (DSFO) design. We first address the memory design challenge in DSFO and present a line buffer for transforming high-throughput streaming data to 2D stencil patterns. In particular, fast context switching is supported to enable the buffer organizing arbitrary sized images seamlessly. This design is proved to be adaptive to general image processing applications in the streaming manner. To deliver a high-performance FPGA accelerator for deep learning (DL) inference, we demonstrate a DL overlay in the second part of thesis that mainly addresses the overlay design challenge on architecture-FPGA layout mismatch. With the FPGA layout consideration, the overlay hardware achieves a near-to-theoretical operating frequency (650 MHz). Meanwhile, the compilation strategy realizes over 80% hardware efficiency (utilization) on different DL layers. To address the challenge of irregular computation with sparse matrices, in the third part of thesis, we propose a DSFO design for the motivative domain of time series analysis. This work also covers algorithm optimization that increases the hardware efficiency. Specifically, we propose a structured sparsity pattern (CSB) for model pruning that trades the flexibility and hardware cost. Then we present the overlay for CSB-based matrix computation, which addresses the workload imbalance issue with both architecture and compilation support. By leveraging the overlay design method, we propose designs for three particular domains with orthogonal challenges. These designs demonstrate significant improvement in performance, flexibility, hardware- and power-efficiency. The proposed techniques strengthen the FPGA overlay design method and move it into a mature state. Importantly, these advantages further prove the generality and effectiveness of overlay method. We believe the proposed overlays will support and inspire future custom computing in more domains.
Degree	Doctor of Philosophy
Subject	Field programmable gate arrays
Dept/Program	Electrical and Electronic Engineering
Persistent Identifier	http://hdl.handle.net/10722/295614

DC Field	Value	Language
dc.contributor.advisor	So, HKH	-
dc.contributor.advisor	Lam, EYM	-
dc.contributor.author	Shi, Runbin	-
dc.contributor.author	石潤彬	-
dc.date.accessioned	2021-02-02T03:05:16Z	-
dc.date.available	2021-02-02T03:05:16Z	-
dc.date.issued	2020	-
dc.identifier.citation	Shi, R. [石潤彬]. (2020). Domain-specific FPGA overlay : an architecture-compilation co-design methodology. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.	-
dc.identifier.uri	http://hdl.handle.net/10722/295614	-
dc.description.abstract	From smartwatch in the edge to data center in the cloud, computations take place everywhere. However, with the end of Moore’s Law in sight, the demand for computing shows explosive growth in recent years. Existing systems that are based on distributed processing or by using general-purpose graph processing unit (GPGPU) for acceleration have demonstrated great success in a few domains such as deep learning and computational protein design. Nevertheless, these general platforms suffer low power efficiency and resource utilization in many cases due to the conflict between generalization and specialization. Customized computing with field-programmable gate array (FPGA) is a promising direction for future parallel computing that benefits both efficiency and hardware cost. However, the FPGA (hardware) design flow is fundamentally different from that of software development. Compared to the general processors, FPGAs have much more fine-grained programmable units and a much larger parallelism scale that introduce great difficulties to the end users. In the decades of investigation on the FPGA design method, overlay has been a very promising form that greatly narrows the gap between workloads and physical architecture. The overlay provides a virtual architecture that adapts to a group of applications and a compilation tool that translates the workload to soft control instructions or hardware with workload-specific configurations. This thesis follows the overlay research routine and pays more attention to the challenges in domain-specific FPGA overlay (DSFO) design. We first address the memory design challenge in DSFO and present a line buffer for transforming high-throughput streaming data to 2D stencil patterns. In particular, fast context switching is supported to enable the buffer organizing arbitrary sized images seamlessly. This design is proved to be adaptive to general image processing applications in the streaming manner. To deliver a high-performance FPGA accelerator for deep learning (DL) inference, we demonstrate a DL overlay in the second part of thesis that mainly addresses the overlay design challenge on architecture-FPGA layout mismatch. With the FPGA layout consideration, the overlay hardware achieves a near-to-theoretical operating frequency (650 MHz). Meanwhile, the compilation strategy realizes over 80% hardware efficiency (utilization) on different DL layers. To address the challenge of irregular computation with sparse matrices, in the third part of thesis, we propose a DSFO design for the motivative domain of time series analysis. This work also covers algorithm optimization that increases the hardware efficiency. Specifically, we propose a structured sparsity pattern (CSB) for model pruning that trades the flexibility and hardware cost. Then we present the overlay for CSB-based matrix computation, which addresses the workload imbalance issue with both architecture and compilation support. By leveraging the overlay design method, we propose designs for three particular domains with orthogonal challenges. These designs demonstrate significant improvement in performance, flexibility, hardware- and power-efficiency. The proposed techniques strengthen the FPGA overlay design method and move it into a mature state. Importantly, these advantages further prove the generality and effectiveness of overlay method. We believe the proposed overlays will support and inspire future custom computing in more domains.	-
dc.language	eng	-
dc.publisher	The University of Hong Kong (Pokfulam, Hong Kong)	-
dc.relation.ispartof	HKU Theses Online (HKUTO)	-
dc.rights	The author retains all proprietary rights, (such as patent rights) and the right to use in future works.	-
dc.rights	This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.	-
dc.subject.lcsh	Field programmable gate arrays	-
dc.title	Domain-specific FPGA overlay : an architecture-compilation co-design methodology	-
dc.type	PG_Thesis	-
dc.description.thesisname	Doctor of Philosophy	-
dc.description.thesislevel	Doctoral	-
dc.description.thesisdiscipline	Electrical and Electronic Engineering	-
dc.description.nature	published_or_final_version	-
dc.date.hkucongregation	2021	-
dc.identifier.mmsid	991044340098703414	-

File Download

Supplementary

postgraduate thesis: Domain-specific FPGA overlay : an architecture-compilation co-design methodology

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats