File Download
Supplementary

postgraduate thesis: Domain-specific FPGA overlay : an architecture-compilation co-design methodology

TitleDomain-specific FPGA overlay : an architecture-compilation co-design methodology
Authors
Advisors
Advisor(s):So, HKHLam, EYM
Issue Date2020
PublisherThe University of Hong Kong (Pokfulam, Hong Kong)
Citation
Shi, R. [石潤彬]. (2020). Domain-specific FPGA overlay : an architecture-compilation co-design methodology. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
AbstractFrom smartwatch in the edge to data center in the cloud, computations take place everywhere. However, with the end of Moore’s Law in sight, the demand for computing shows explosive growth in recent years. Existing systems that are based on distributed processing or by using general-purpose graph processing unit (GPGPU) for acceleration have demonstrated great success in a few domains such as deep learning and computational protein design. Nevertheless, these general platforms suffer low power efficiency and resource utilization in many cases due to the conflict between generalization and specialization. Customized computing with field-programmable gate array (FPGA) is a promising direction for future parallel computing that benefits both efficiency and hardware cost. However, the FPGA (hardware) design flow is fundamentally different from that of software development. Compared to the general processors, FPGAs have much more fine-grained programmable units and a much larger parallelism scale that introduce great difficulties to the end users. In the decades of investigation on the FPGA design method, overlay has been a very promising form that greatly narrows the gap between workloads and physical architecture. The overlay provides a virtual architecture that adapts to a group of applications and a compilation tool that translates the workload to soft control instructions or hardware with workload-specific configurations. This thesis follows the overlay research routine and pays more attention to the challenges in domain-specific FPGA overlay (DSFO) design. We first address the memory design challenge in DSFO and present a line buffer for transforming high-throughput streaming data to 2D stencil patterns. In particular, fast context switching is supported to enable the buffer organizing arbitrary sized images seamlessly. This design is proved to be adaptive to general image processing applications in the streaming manner. To deliver a high-performance FPGA accelerator for deep learning (DL) inference, we demonstrate a DL overlay in the second part of thesis that mainly addresses the overlay design challenge on architecture-FPGA layout mismatch. With the FPGA layout consideration, the overlay hardware achieves a near-to-theoretical operating frequency (650 MHz). Meanwhile, the compilation strategy realizes over 80% hardware efficiency (utilization) on different DL layers. To address the challenge of irregular computation with sparse matrices, in the third part of thesis, we propose a DSFO design for the motivative domain of time series analysis. This work also covers algorithm optimization that increases the hardware efficiency. Specifically, we propose a structured sparsity pattern (CSB) for model pruning that trades the flexibility and hardware cost. Then we present the overlay for CSB-based matrix computation, which addresses the workload imbalance issue with both architecture and compilation support. By leveraging the overlay design method, we propose designs for three particular domains with orthogonal challenges. These designs demonstrate significant improvement in performance, flexibility, hardware- and power-efficiency. The proposed techniques strengthen the FPGA overlay design method and move it into a mature state. Importantly, these advantages further prove the generality and effectiveness of overlay method. We believe the proposed overlays will support and inspire future custom computing in more domains.
DegreeDoctor of Philosophy
SubjectField programmable gate arrays
Dept/ProgramElectrical and Electronic Engineering
Persistent Identifierhttp://hdl.handle.net/10722/295614

 

DC FieldValueLanguage
dc.contributor.advisorSo, HKH-
dc.contributor.advisorLam, EYM-
dc.contributor.authorShi, Runbin-
dc.contributor.author石潤彬-
dc.date.accessioned2021-02-02T03:05:16Z-
dc.date.available2021-02-02T03:05:16Z-
dc.date.issued2020-
dc.identifier.citationShi, R. [石潤彬]. (2020). Domain-specific FPGA overlay : an architecture-compilation co-design methodology. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.-
dc.identifier.urihttp://hdl.handle.net/10722/295614-
dc.description.abstractFrom smartwatch in the edge to data center in the cloud, computations take place everywhere. However, with the end of Moore’s Law in sight, the demand for computing shows explosive growth in recent years. Existing systems that are based on distributed processing or by using general-purpose graph processing unit (GPGPU) for acceleration have demonstrated great success in a few domains such as deep learning and computational protein design. Nevertheless, these general platforms suffer low power efficiency and resource utilization in many cases due to the conflict between generalization and specialization. Customized computing with field-programmable gate array (FPGA) is a promising direction for future parallel computing that benefits both efficiency and hardware cost. However, the FPGA (hardware) design flow is fundamentally different from that of software development. Compared to the general processors, FPGAs have much more fine-grained programmable units and a much larger parallelism scale that introduce great difficulties to the end users. In the decades of investigation on the FPGA design method, overlay has been a very promising form that greatly narrows the gap between workloads and physical architecture. The overlay provides a virtual architecture that adapts to a group of applications and a compilation tool that translates the workload to soft control instructions or hardware with workload-specific configurations. This thesis follows the overlay research routine and pays more attention to the challenges in domain-specific FPGA overlay (DSFO) design. We first address the memory design challenge in DSFO and present a line buffer for transforming high-throughput streaming data to 2D stencil patterns. In particular, fast context switching is supported to enable the buffer organizing arbitrary sized images seamlessly. This design is proved to be adaptive to general image processing applications in the streaming manner. To deliver a high-performance FPGA accelerator for deep learning (DL) inference, we demonstrate a DL overlay in the second part of thesis that mainly addresses the overlay design challenge on architecture-FPGA layout mismatch. With the FPGA layout consideration, the overlay hardware achieves a near-to-theoretical operating frequency (650 MHz). Meanwhile, the compilation strategy realizes over 80% hardware efficiency (utilization) on different DL layers. To address the challenge of irregular computation with sparse matrices, in the third part of thesis, we propose a DSFO design for the motivative domain of time series analysis. This work also covers algorithm optimization that increases the hardware efficiency. Specifically, we propose a structured sparsity pattern (CSB) for model pruning that trades the flexibility and hardware cost. Then we present the overlay for CSB-based matrix computation, which addresses the workload imbalance issue with both architecture and compilation support. By leveraging the overlay design method, we propose designs for three particular domains with orthogonal challenges. These designs demonstrate significant improvement in performance, flexibility, hardware- and power-efficiency. The proposed techniques strengthen the FPGA overlay design method and move it into a mature state. Importantly, these advantages further prove the generality and effectiveness of overlay method. We believe the proposed overlays will support and inspire future custom computing in more domains.-
dc.languageeng-
dc.publisherThe University of Hong Kong (Pokfulam, Hong Kong)-
dc.relation.ispartofHKU Theses Online (HKUTO)-
dc.rightsThe author retains all proprietary rights, (such as patent rights) and the right to use in future works.-
dc.rightsThis work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.-
dc.subject.lcshField programmable gate arrays-
dc.titleDomain-specific FPGA overlay : an architecture-compilation co-design methodology-
dc.typePG_Thesis-
dc.description.thesisnameDoctor of Philosophy-
dc.description.thesislevelDoctoral-
dc.description.thesisdisciplineElectrical and Electronic Engineering-
dc.description.naturepublished_or_final_version-
dc.date.hkucongregation2021-
dc.identifier.mmsid991044340098703414-

Export via OAI-PMH Interface in XML Formats


OR


Export to Other Non-XML Formats