File Download
  Links for fulltext
     (May Require Subscription)
Supplementary

postgraduate thesis: Accelerating machine learning applications with FPGAs

TitleAccelerating machine learning applications with FPGAs
Authors
Advisors
Advisor(s):So, HKHLam, EYM
Issue Date2018
PublisherThe University of Hong Kong (Pokfulam, Hong Kong)
Citation
Ho, M. [何文灝]. (2018). Accelerating machine learning applications with FPGAs. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
Abstract As the demand for a more power efficient device to handle machine learning tasks rises, FPGAs serve as a platform with great potential, due to their flexibility to adapt to novel algorithms and designs. However, FPGA users face many unique difficulties when implementing such algorithms, for instances the lack of algorithmic libraries, and the lack of custom-precision math operator libraries, or the lack of reusable common interfaces for host-software integration. In this thesis, we start by presenting a case study on accelerating the Support Vector Machine (SVM) training on an Apache Spark cluster equipped with FPGAs, demonstrating a consistent speed-up of about 1.6x against CPUs as the cluster size increases up to 8-nodes. Then, to address the problem of the lack of math operators, we propose an open source function generator, NnCore, for floating-point non-linear operator cores built using fixed-point piecewise polynomial segments. The proposed work takes advantage of properties such as oddness/evenness and intercept-at-origin, often found in numerical functions commonly used in machine learning applications, and applies an improved segmentation algorithm that specifically handles ``outlier'' segments to reduce memory size. Experiments show that at single-precision, generated cores use up to 65% fewer BRAMs and runs at up to 2.2x the clock speed, compared with cores generated from a previous generic function generator. At half-precision, cores can run at 1.2x higher clock speed while requiring more resource, or use a comparable number of resource but run at 12% to 45% lower clock speed. At last, we present hDNN, a software-hardware integrated research platform for deep learning, which can serve as both the building blocks for further research, or as a baseline comparison target for benchmarks. It consists of a collection of hardware IP modules, written in HLS C++ for the Xilinx SDAccel platform, and a modified Caffe that enables support of quantized arithmetic in individual layers with user-defined quantization scheme, that can target to run on CPUs/GPUs/FPGAs. The hardware designs include a 32x32 systolic array that runs at 200 MHz for 16-bit integer/fixed-point types on a Virtex-7 FPGA, which translates to a 409.6 GOPS theoretical throughput. Together with an ``Im2col'' module provided, convolution operations can also be achieved. Experiments show that speed-up in the range of 1.8x to 32.7x is achieved against an optimized CPU implementation, for the matrix multiplication part of convolution layers in networks like Lenet, Cifar10 and Caffenet.
DegreeDoctor of Philosophy
SubjectMachine learning
Field programmable gate arrays
Dept/ProgramElectrical and Electronic Engineering
Persistent Identifierhttp://hdl.handle.net/10722/263209

 

DC FieldValueLanguage
dc.contributor.advisorSo, HKH-
dc.contributor.advisorLam, EYM-
dc.contributor.authorHo, Man-ho-
dc.contributor.author何文灝-
dc.date.accessioned2018-10-16T07:35:01Z-
dc.date.available2018-10-16T07:35:01Z-
dc.date.issued2018-
dc.identifier.citationHo, M. [何文灝]. (2018). Accelerating machine learning applications with FPGAs. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.-
dc.identifier.urihttp://hdl.handle.net/10722/263209-
dc.description.abstract As the demand for a more power efficient device to handle machine learning tasks rises, FPGAs serve as a platform with great potential, due to their flexibility to adapt to novel algorithms and designs. However, FPGA users face many unique difficulties when implementing such algorithms, for instances the lack of algorithmic libraries, and the lack of custom-precision math operator libraries, or the lack of reusable common interfaces for host-software integration. In this thesis, we start by presenting a case study on accelerating the Support Vector Machine (SVM) training on an Apache Spark cluster equipped with FPGAs, demonstrating a consistent speed-up of about 1.6x against CPUs as the cluster size increases up to 8-nodes. Then, to address the problem of the lack of math operators, we propose an open source function generator, NnCore, for floating-point non-linear operator cores built using fixed-point piecewise polynomial segments. The proposed work takes advantage of properties such as oddness/evenness and intercept-at-origin, often found in numerical functions commonly used in machine learning applications, and applies an improved segmentation algorithm that specifically handles ``outlier'' segments to reduce memory size. Experiments show that at single-precision, generated cores use up to 65% fewer BRAMs and runs at up to 2.2x the clock speed, compared with cores generated from a previous generic function generator. At half-precision, cores can run at 1.2x higher clock speed while requiring more resource, or use a comparable number of resource but run at 12% to 45% lower clock speed. At last, we present hDNN, a software-hardware integrated research platform for deep learning, which can serve as both the building blocks for further research, or as a baseline comparison target for benchmarks. It consists of a collection of hardware IP modules, written in HLS C++ for the Xilinx SDAccel platform, and a modified Caffe that enables support of quantized arithmetic in individual layers with user-defined quantization scheme, that can target to run on CPUs/GPUs/FPGAs. The hardware designs include a 32x32 systolic array that runs at 200 MHz for 16-bit integer/fixed-point types on a Virtex-7 FPGA, which translates to a 409.6 GOPS theoretical throughput. Together with an ``Im2col'' module provided, convolution operations can also be achieved. Experiments show that speed-up in the range of 1.8x to 32.7x is achieved against an optimized CPU implementation, for the matrix multiplication part of convolution layers in networks like Lenet, Cifar10 and Caffenet. -
dc.languageeng-
dc.publisherThe University of Hong Kong (Pokfulam, Hong Kong)-
dc.relation.ispartofHKU Theses Online (HKUTO)-
dc.rightsThe author retains all proprietary rights, (such as patent rights) and the right to use in future works.-
dc.rightsThis work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.-
dc.subject.lcshMachine learning-
dc.subject.lcshField programmable gate arrays-
dc.titleAccelerating machine learning applications with FPGAs-
dc.typePG_Thesis-
dc.description.thesisnameDoctor of Philosophy-
dc.description.thesislevelDoctoral-
dc.description.thesisdisciplineElectrical and Electronic Engineering-
dc.description.naturepublished_or_final_version-
dc.identifier.doi10.5353/th_991044046591103414-
dc.date.hkucongregation2018-
dc.identifier.mmsid991044046591103414-

Export via OAI-PMH Interface in XML Formats


OR


Export to Other Non-XML Formats