Accelerating machine learning applications with FPGAs

Ho, Man-ho; 何文灝

File Download

FullText.pdf

Links for fulltext

(May Require Subscription)

DOI: 10.5353/th_991044046591103414

Supplementary

Citations:
Appears in Collections:
- HKU Theses Online
- Electrical & Electronic Engineering: Theses

postgraduate thesis: Accelerating machine learning applications with FPGAs

Title	Accelerating machine learning applications with FPGAs
Authors	Ho, Man-ho 何文灝
Advisors	Advisor(s):So, HKH Lam, EYM
Issue Date	2018
Publisher	The University of Hong Kong (Pokfulam, Hong Kong)
Citation	Ho, M. [何文灝]. (2018). Accelerating machine learning applications with FPGAs. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
Abstract	As the demand for a more power efficient device to handle machine learning tasks rises, FPGAs serve as a platform with great potential, due to their flexibility to adapt to novel algorithms and designs. However, FPGA users face many unique difficulties when implementing such algorithms, for instances the lack of algorithmic libraries, and the lack of custom-precision math operator libraries, or the lack of reusable common interfaces for host-software integration. In this thesis, we start by presenting a case study on accelerating the Support Vector Machine (SVM) training on an Apache Spark cluster equipped with FPGAs, demonstrating a consistent speed-up of about 1.6x against CPUs as the cluster size increases up to 8-nodes. Then, to address the problem of the lack of math operators, we propose an open source function generator, NnCore, for floating-point non-linear operator cores built using fixed-point piecewise polynomial segments. The proposed work takes advantage of properties such as oddness/evenness and intercept-at-origin, often found in numerical functions commonly used in machine learning applications, and applies an improved segmentation algorithm that specifically handles ``outlier'' segments to reduce memory size. Experiments show that at single-precision, generated cores use up to 65% fewer BRAMs and runs at up to 2.2x the clock speed, compared with cores generated from a previous generic function generator. At half-precision, cores can run at 1.2x higher clock speed while requiring more resource, or use a comparable number of resource but run at 12% to 45% lower clock speed. At last, we present hDNN, a software-hardware integrated research platform for deep learning, which can serve as both the building blocks for further research, or as a baseline comparison target for benchmarks. It consists of a collection of hardware IP modules, written in HLS C++ for the Xilinx SDAccel platform, and a modified Caffe that enables support of quantized arithmetic in individual layers with user-defined quantization scheme, that can target to run on CPUs/GPUs/FPGAs. The hardware designs include a 32x32 systolic array that runs at 200 MHz for 16-bit integer/fixed-point types on a Virtex-7 FPGA, which translates to a 409.6 GOPS theoretical throughput. Together with an ``Im2col'' module provided, convolution operations can also be achieved. Experiments show that speed-up in the range of 1.8x to 32.7x is achieved against an optimized CPU implementation, for the matrix multiplication part of convolution layers in networks like Lenet, Cifar10 and Caffenet.
Degree	Doctor of Philosophy
Subject	Machine learning Field programmable gate arrays
Dept/Program	Electrical and Electronic Engineering
Persistent Identifier	http://hdl.handle.net/10722/263209

DC Field	Value	Language
dc.contributor.advisor	So, HKH	-
dc.contributor.advisor	Lam, EYM	-
dc.contributor.author	Ho, Man-ho	-
dc.contributor.author	何文灝	-
dc.date.accessioned	2018-10-16T07:35:01Z	-
dc.date.available	2018-10-16T07:35:01Z	-
dc.date.issued	2018	-
dc.identifier.citation	Ho, M. [何文灝]. (2018). Accelerating machine learning applications with FPGAs. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.	-
dc.identifier.uri	http://hdl.handle.net/10722/263209	-
dc.description.abstract	As the demand for a more power efficient device to handle machine learning tasks rises, FPGAs serve as a platform with great potential, due to their flexibility to adapt to novel algorithms and designs. However, FPGA users face many unique difficulties when implementing such algorithms, for instances the lack of algorithmic libraries, and the lack of custom-precision math operator libraries, or the lack of reusable common interfaces for host-software integration. In this thesis, we start by presenting a case study on accelerating the Support Vector Machine (SVM) training on an Apache Spark cluster equipped with FPGAs, demonstrating a consistent speed-up of about 1.6x against CPUs as the cluster size increases up to 8-nodes. Then, to address the problem of the lack of math operators, we propose an open source function generator, NnCore, for floating-point non-linear operator cores built using fixed-point piecewise polynomial segments. The proposed work takes advantage of properties such as oddness/evenness and intercept-at-origin, often found in numerical functions commonly used in machine learning applications, and applies an improved segmentation algorithm that specifically handles ``outlier'' segments to reduce memory size. Experiments show that at single-precision, generated cores use up to 65% fewer BRAMs and runs at up to 2.2x the clock speed, compared with cores generated from a previous generic function generator. At half-precision, cores can run at 1.2x higher clock speed while requiring more resource, or use a comparable number of resource but run at 12% to 45% lower clock speed. At last, we present hDNN, a software-hardware integrated research platform for deep learning, which can serve as both the building blocks for further research, or as a baseline comparison target for benchmarks. It consists of a collection of hardware IP modules, written in HLS C++ for the Xilinx SDAccel platform, and a modified Caffe that enables support of quantized arithmetic in individual layers with user-defined quantization scheme, that can target to run on CPUs/GPUs/FPGAs. The hardware designs include a 32x32 systolic array that runs at 200 MHz for 16-bit integer/fixed-point types on a Virtex-7 FPGA, which translates to a 409.6 GOPS theoretical throughput. Together with an ``Im2col'' module provided, convolution operations can also be achieved. Experiments show that speed-up in the range of 1.8x to 32.7x is achieved against an optimized CPU implementation, for the matrix multiplication part of convolution layers in networks like Lenet, Cifar10 and Caffenet.	-
dc.language	eng	-
dc.publisher	The University of Hong Kong (Pokfulam, Hong Kong)	-
dc.relation.ispartof	HKU Theses Online (HKUTO)	-
dc.rights	The author retains all proprietary rights, (such as patent rights) and the right to use in future works.	-
dc.rights	This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.	-
dc.subject.lcsh	Machine learning	-
dc.subject.lcsh	Field programmable gate arrays	-
dc.title	Accelerating machine learning applications with FPGAs	-
dc.type	PG_Thesis	-
dc.description.thesisname	Doctor of Philosophy	-
dc.description.thesislevel	Doctoral	-
dc.description.thesisdiscipline	Electrical and Electronic Engineering	-
dc.description.nature	published_or_final_version	-
dc.identifier.doi	10.5353/th_991044046591103414	-
dc.date.hkucongregation	2018	-
dc.identifier.mmsid	991044046591103414	-

File Download

Links for fulltext

(May Require Subscription)

Supplementary

postgraduate thesis: Accelerating machine learning applications with FPGAs

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats