Profile-guided loop parallelization and co-scheduling on GPU-based heterogeneous many-core architectures

Han, Guodong; 韩国栋

File Download

FullText.pdf

Links for fulltext

(May Require Subscription)

DOI: 10.5353/th_b5053425

Supplementary

Citations:
Appears in Collections:
- Computer Science & Information Systems: Theses
- HKU Theses Online

postgraduate thesis: Profile-guided loop parallelization and co-scheduling on GPU-based heterogeneous many-core architectures

Title	Profile-guided loop parallelization and co-scheduling on GPU-based heterogeneous many-core architectures
Authors	Han, Guodong 韩国栋
Advisors	Advisor(s):Wang, CL
Issue Date	2013
Publisher	The University of Hong Kong (Pokfulam, Hong Kong)
Citation	Han, G. [韩国栋]. (2013). Profile-guided loop parallelization and co-scheduling on GPU-based heterogeneous many-core architectures. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. Retrieved from http://dx.doi.org/10.5353/th_b5053425
Abstract	The GPU-based heterogeneous architectures (e.g., Tianhe-1A, Nebulae), composing multi-core CPU and GPU, have drawn increasing adoptions and are becoming the norm of supercomputing as they are cost-effective and power-efficient. However, programming such heterogeneous architectures still requires significant effort from application developers using sophisticated GPU programming languages such as CUDA and OpenCL. Although some automatic parallelization tools utilizing static analysis could ease the programming efforts, this approach could only parallelize loops 100% free of inter-iteration dependency (i.e., determined DO-ALL loops) because of imprecision of static analysis. To exploit the abundant runtime parallelism and take full advantage of the computing resources both in CPU and GPU, in this work, we propose a new user-friendly compiler framework and runtime system, which helps Java applications harness the full power of a heterogeneous system. It unveils an all-round system design unifying the programming style and language for transparent use of both CPUs and GPUs, automatically parallelizing all kinds of loops, scheduling workloads efficiently across CPU and GPU resources while ensuring data coherence during highly-threaded execution. By means of simple user annotations, sequential Java source code will be analyzed, translated and compiled into a dual executable consisting of CUDA kernels and multiple Java threads running on GPU and CPU cores respectively. Annotated loops will be automatically split into loop chunks (or tasks) being scheduled to execute on all available GPU/CPU cores. To guide the runtime task scheduling, we develop a novel dynamic loop profiler which generates the program dependency graph (PDG) and computes the density of dependencies across iterations through a hybrid checking scheme combining intra-warp and inter-warp analyses. Implementing a GPU-tailored thread-level speculation (TLS) model, our system supports speculative execution of loops with moderate dependency densities and privatization of loops having only false dependencies on the GPU side. Our scheduler also supports task stealing and task sharing algorithms that allow swift load redistribution across GPU and CPU. We have carried out several experiments to evaluate the profiling overhead and up to 11 real-life applications to evaluate our system performance. Testing results show that the overhead is moderate compared with the sequential execution and prove that almost all the applications could benefit from our system.
Degree	Master of Philosophy
Subject	Graphics processing units. Parallel processing (Electronic computers) Computer architecture.
Dept/Program	Computer Science
Persistent Identifier	http://hdl.handle.net/10722/188306
HKU Library Item ID	b5053425

DC Field	Value	Language
dc.contributor.advisor	Wang, CL	-
dc.contributor.author	Han, Guodong	-
dc.contributor.author	韩国栋	-
dc.date.accessioned	2013-08-27T08:03:33Z	-
dc.date.available	2013-08-27T08:03:33Z	-
dc.date.issued	2013	-
dc.identifier.citation	Han, G. [韩国栋]. (2013). Profile-guided loop parallelization and co-scheduling on GPU-based heterogeneous many-core architectures. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. Retrieved from http://dx.doi.org/10.5353/th_b5053425	-
dc.identifier.uri	http://hdl.handle.net/10722/188306	-
dc.description.abstract	The GPU-based heterogeneous architectures (e.g., Tianhe-1A, Nebulae), composing multi-core CPU and GPU, have drawn increasing adoptions and are becoming the norm of supercomputing as they are cost-effective and power-efficient. However, programming such heterogeneous architectures still requires significant effort from application developers using sophisticated GPU programming languages such as CUDA and OpenCL. Although some automatic parallelization tools utilizing static analysis could ease the programming efforts, this approach could only parallelize loops 100% free of inter-iteration dependency (i.e., determined DO-ALL loops) because of imprecision of static analysis. To exploit the abundant runtime parallelism and take full advantage of the computing resources both in CPU and GPU, in this work, we propose a new user-friendly compiler framework and runtime system, which helps Java applications harness the full power of a heterogeneous system. It unveils an all-round system design unifying the programming style and language for transparent use of both CPUs and GPUs, automatically parallelizing all kinds of loops, scheduling workloads efficiently across CPU and GPU resources while ensuring data coherence during highly-threaded execution. By means of simple user annotations, sequential Java source code will be analyzed, translated and compiled into a dual executable consisting of CUDA kernels and multiple Java threads running on GPU and CPU cores respectively. Annotated loops will be automatically split into loop chunks (or tasks) being scheduled to execute on all available GPU/CPU cores. To guide the runtime task scheduling, we develop a novel dynamic loop profiler which generates the program dependency graph (PDG) and computes the density of dependencies across iterations through a hybrid checking scheme combining intra-warp and inter-warp analyses. Implementing a GPU-tailored thread-level speculation (TLS) model, our system supports speculative execution of loops with moderate dependency densities and privatization of loops having only false dependencies on the GPU side. Our scheduler also supports task stealing and task sharing algorithms that allow swift load redistribution across GPU and CPU. We have carried out several experiments to evaluate the profiling overhead and up to 11 real-life applications to evaluate our system performance. Testing results show that the overhead is moderate compared with the sequential execution and prove that almost all the applications could benefit from our system.	-
dc.language	eng	-
dc.publisher	The University of Hong Kong (Pokfulam, Hong Kong)	-
dc.relation.ispartof	HKU Theses Online (HKUTO)	-
dc.rights	The author retains all proprietary rights, (such as patent rights) and the right to use in future works.	-
dc.rights	This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.	-
dc.source.uri	http://hub.hku.hk/bib/B50534257	-
dc.subject.lcsh	Graphics processing units.	-
dc.subject.lcsh	Parallel processing (Electronic computers)	-
dc.subject.lcsh	Computer architecture.	-
dc.title	Profile-guided loop parallelization and co-scheduling on GPU-based heterogeneous many-core architectures	-
dc.type	PG_Thesis	-
dc.identifier.hkul	b5053425	-
dc.description.thesisname	Master of Philosophy	-
dc.description.thesislevel	Master	-
dc.description.thesisdiscipline	Computer Science	-
dc.description.nature	published_or_final_version	-
dc.identifier.doi	10.5353/th_b5053425	-
dc.date.hkucongregation	2013	-
dc.identifier.mmsid	991035481699703414	-

File Download

Links for fulltext

(May Require Subscription)

Supplementary

postgraduate thesis: Profile-guided loop parallelization and co-scheduling on GPU-based heterogeneous many-core architectures

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats