File Download
Supplementary
-
Citations:
- Appears in Collections:
postgraduate thesis: Domain-specific FPGA overlay : an architecture-compilation co-design methodology
Title | Domain-specific FPGA overlay : an architecture-compilation co-design methodology |
---|---|
Authors | |
Advisors | |
Issue Date | 2020 |
Publisher | The University of Hong Kong (Pokfulam, Hong Kong) |
Citation | Shi, R. [石潤彬]. (2020). Domain-specific FPGA overlay : an architecture-compilation co-design methodology. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. |
Abstract | From smartwatch in the edge to data center in the cloud, computations take place everywhere. However, with the end of Moore’s Law in sight, the demand for computing shows explosive growth in recent years. Existing systems that are based on distributed processing or by using general-purpose graph processing unit (GPGPU) for acceleration have demonstrated great success in a few domains such as deep learning and computational protein design. Nevertheless, these general platforms suffer low power efficiency and resource utilization in many cases due to the conflict between generalization and specialization. Customized computing with field-programmable gate array (FPGA) is a promising direction for future parallel computing that benefits both efficiency and hardware cost.
However, the FPGA (hardware) design flow is fundamentally different from that of software development. Compared to the general processors, FPGAs have much more fine-grained programmable units and a much larger parallelism scale that introduce great difficulties to the end users. In the decades of investigation on the FPGA design method, overlay has been a very promising form that greatly narrows the gap between workloads and physical architecture. The overlay provides a virtual architecture that adapts to a group of applications and a compilation tool that translates the workload to soft control instructions or hardware with workload-specific configurations.
This thesis follows the overlay research routine and pays more attention to the challenges in domain-specific FPGA overlay (DSFO) design. We first address the memory design challenge in DSFO and present a line buffer for transforming high-throughput streaming data to 2D stencil patterns. In particular, fast context switching is supported to enable the buffer organizing arbitrary sized images seamlessly. This design is proved to be adaptive to general image processing applications in the streaming manner.
To deliver a high-performance FPGA accelerator for deep learning (DL) inference, we demonstrate a DL overlay in the second part of thesis that mainly addresses the overlay design challenge on architecture-FPGA layout mismatch. With the FPGA layout consideration, the overlay hardware achieves a near-to-theoretical operating frequency (650 MHz). Meanwhile, the compilation strategy realizes over 80% hardware efficiency (utilization) on different DL layers.
To address the challenge of irregular computation with sparse matrices, in the third part of thesis, we propose a DSFO design for the motivative domain of time series analysis. This work also covers algorithm optimization that increases the hardware efficiency. Specifically, we propose a structured sparsity pattern (CSB) for model pruning that trades the flexibility and hardware cost. Then we present the overlay for CSB-based matrix computation, which addresses the workload imbalance issue with both architecture and compilation support.
By leveraging the overlay design method, we propose designs for three particular domains with orthogonal challenges. These designs demonstrate significant improvement in performance, flexibility, hardware- and power-efficiency. The proposed techniques strengthen the FPGA overlay design method and move it into a mature state. Importantly, these advantages further prove the generality and effectiveness of overlay method. We believe the proposed overlays will support and inspire future custom computing in more domains. |
Degree | Doctor of Philosophy |
Subject | Field programmable gate arrays |
Dept/Program | Electrical and Electronic Engineering |
Persistent Identifier | http://hdl.handle.net/10722/295614 |
DC Field | Value | Language |
---|---|---|
dc.contributor.advisor | So, HKH | - |
dc.contributor.advisor | Lam, EYM | - |
dc.contributor.author | Shi, Runbin | - |
dc.contributor.author | 石潤彬 | - |
dc.date.accessioned | 2021-02-02T03:05:16Z | - |
dc.date.available | 2021-02-02T03:05:16Z | - |
dc.date.issued | 2020 | - |
dc.identifier.citation | Shi, R. [石潤彬]. (2020). Domain-specific FPGA overlay : an architecture-compilation co-design methodology. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. | - |
dc.identifier.uri | http://hdl.handle.net/10722/295614 | - |
dc.description.abstract | From smartwatch in the edge to data center in the cloud, computations take place everywhere. However, with the end of Moore’s Law in sight, the demand for computing shows explosive growth in recent years. Existing systems that are based on distributed processing or by using general-purpose graph processing unit (GPGPU) for acceleration have demonstrated great success in a few domains such as deep learning and computational protein design. Nevertheless, these general platforms suffer low power efficiency and resource utilization in many cases due to the conflict between generalization and specialization. Customized computing with field-programmable gate array (FPGA) is a promising direction for future parallel computing that benefits both efficiency and hardware cost. However, the FPGA (hardware) design flow is fundamentally different from that of software development. Compared to the general processors, FPGAs have much more fine-grained programmable units and a much larger parallelism scale that introduce great difficulties to the end users. In the decades of investigation on the FPGA design method, overlay has been a very promising form that greatly narrows the gap between workloads and physical architecture. The overlay provides a virtual architecture that adapts to a group of applications and a compilation tool that translates the workload to soft control instructions or hardware with workload-specific configurations. This thesis follows the overlay research routine and pays more attention to the challenges in domain-specific FPGA overlay (DSFO) design. We first address the memory design challenge in DSFO and present a line buffer for transforming high-throughput streaming data to 2D stencil patterns. In particular, fast context switching is supported to enable the buffer organizing arbitrary sized images seamlessly. This design is proved to be adaptive to general image processing applications in the streaming manner. To deliver a high-performance FPGA accelerator for deep learning (DL) inference, we demonstrate a DL overlay in the second part of thesis that mainly addresses the overlay design challenge on architecture-FPGA layout mismatch. With the FPGA layout consideration, the overlay hardware achieves a near-to-theoretical operating frequency (650 MHz). Meanwhile, the compilation strategy realizes over 80% hardware efficiency (utilization) on different DL layers. To address the challenge of irregular computation with sparse matrices, in the third part of thesis, we propose a DSFO design for the motivative domain of time series analysis. This work also covers algorithm optimization that increases the hardware efficiency. Specifically, we propose a structured sparsity pattern (CSB) for model pruning that trades the flexibility and hardware cost. Then we present the overlay for CSB-based matrix computation, which addresses the workload imbalance issue with both architecture and compilation support. By leveraging the overlay design method, we propose designs for three particular domains with orthogonal challenges. These designs demonstrate significant improvement in performance, flexibility, hardware- and power-efficiency. The proposed techniques strengthen the FPGA overlay design method and move it into a mature state. Importantly, these advantages further prove the generality and effectiveness of overlay method. We believe the proposed overlays will support and inspire future custom computing in more domains. | - |
dc.language | eng | - |
dc.publisher | The University of Hong Kong (Pokfulam, Hong Kong) | - |
dc.relation.ispartof | HKU Theses Online (HKUTO) | - |
dc.rights | The author retains all proprietary rights, (such as patent rights) and the right to use in future works. | - |
dc.rights | This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. | - |
dc.subject.lcsh | Field programmable gate arrays | - |
dc.title | Domain-specific FPGA overlay : an architecture-compilation co-design methodology | - |
dc.type | PG_Thesis | - |
dc.description.thesisname | Doctor of Philosophy | - |
dc.description.thesislevel | Doctoral | - |
dc.description.thesisdiscipline | Electrical and Electronic Engineering | - |
dc.description.nature | published_or_final_version | - |
dc.date.hkucongregation | 2021 | - |
dc.identifier.mmsid | 991044340098703414 | - |