File Download
  Links for fulltext
     (May Require Subscription)
Supplementary

postgraduate thesis: Profile-guided loop parallelization and co-scheduling on GPU-based heterogeneous many-core architectures

TitleProfile-guided loop parallelization and co-scheduling on GPU-based heterogeneous many-core architectures
Authors
Advisors
Advisor(s):Wang, CL
Issue Date2013
PublisherThe University of Hong Kong (Pokfulam, Hong Kong)
Citation
Han, G. [韩国栋]. (2013). Profile-guided loop parallelization and co-scheduling on GPU-based heterogeneous many-core architectures. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. Retrieved from http://dx.doi.org/10.5353/th_b5053425
AbstractThe GPU-based heterogeneous architectures (e.g., Tianhe-1A, Nebulae), composing multi-core CPU and GPU, have drawn increasing adoptions and are becoming the norm of supercomputing as they are cost-effective and power-efficient. However, programming such heterogeneous architectures still requires significant effort from application developers using sophisticated GPU programming languages such as CUDA and OpenCL. Although some automatic parallelization tools utilizing static analysis could ease the programming efforts, this approach could only parallelize loops 100% free of inter-iteration dependency (i.e., determined DO-ALL loops) because of imprecision of static analysis. To exploit the abundant runtime parallelism and take full advantage of the computing resources both in CPU and GPU, in this work, we propose a new user-friendly compiler framework and runtime system, which helps Java applications harness the full power of a heterogeneous system. It unveils an all-round system design unifying the programming style and language for transparent use of both CPUs and GPUs, automatically parallelizing all kinds of loops, scheduling workloads efficiently across CPU and GPU resources while ensuring data coherence during highly-threaded execution. By means of simple user annotations, sequential Java source code will be analyzed, translated and compiled into a dual executable consisting of CUDA kernels and multiple Java threads running on GPU and CPU cores respectively. Annotated loops will be automatically split into loop chunks (or tasks) being scheduled to execute on all available GPU/CPU cores. To guide the runtime task scheduling, we develop a novel dynamic loop profiler which generates the program dependency graph (PDG) and computes the density of dependencies across iterations through a hybrid checking scheme combining intra-warp and inter-warp analyses. Implementing a GPU-tailored thread-level speculation (TLS) model, our system supports speculative execution of loops with moderate dependency densities and privatization of loops having only false dependencies on the GPU side. Our scheduler also supports task stealing and task sharing algorithms that allow swift load redistribution across GPU and CPU. We have carried out several experiments to evaluate the profiling overhead and up to 11 real-life applications to evaluate our system performance. Testing results show that the overhead is moderate compared with the sequential execution and prove that almost all the applications could benefit from our system.
DegreeMaster of Philosophy
SubjectGraphics processing units.
Parallel processing (Electronic computers)
Computer architecture.
Dept/ProgramComputer Science
Persistent Identifierhttp://hdl.handle.net/10722/188306

 

DC FieldValueLanguage
dc.contributor.advisorWang, CL-
dc.contributor.authorHan, Guodong-
dc.contributor.author韩国栋-
dc.date.accessioned2013-08-27T08:03:33Z-
dc.date.available2013-08-27T08:03:33Z-
dc.date.issued2013-
dc.identifier.citationHan, G. [韩国栋]. (2013). Profile-guided loop parallelization and co-scheduling on GPU-based heterogeneous many-core architectures. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. Retrieved from http://dx.doi.org/10.5353/th_b5053425-
dc.identifier.urihttp://hdl.handle.net/10722/188306-
dc.description.abstractThe GPU-based heterogeneous architectures (e.g., Tianhe-1A, Nebulae), composing multi-core CPU and GPU, have drawn increasing adoptions and are becoming the norm of supercomputing as they are cost-effective and power-efficient. However, programming such heterogeneous architectures still requires significant effort from application developers using sophisticated GPU programming languages such as CUDA and OpenCL. Although some automatic parallelization tools utilizing static analysis could ease the programming efforts, this approach could only parallelize loops 100% free of inter-iteration dependency (i.e., determined DO-ALL loops) because of imprecision of static analysis. To exploit the abundant runtime parallelism and take full advantage of the computing resources both in CPU and GPU, in this work, we propose a new user-friendly compiler framework and runtime system, which helps Java applications harness the full power of a heterogeneous system. It unveils an all-round system design unifying the programming style and language for transparent use of both CPUs and GPUs, automatically parallelizing all kinds of loops, scheduling workloads efficiently across CPU and GPU resources while ensuring data coherence during highly-threaded execution. By means of simple user annotations, sequential Java source code will be analyzed, translated and compiled into a dual executable consisting of CUDA kernels and multiple Java threads running on GPU and CPU cores respectively. Annotated loops will be automatically split into loop chunks (or tasks) being scheduled to execute on all available GPU/CPU cores. To guide the runtime task scheduling, we develop a novel dynamic loop profiler which generates the program dependency graph (PDG) and computes the density of dependencies across iterations through a hybrid checking scheme combining intra-warp and inter-warp analyses. Implementing a GPU-tailored thread-level speculation (TLS) model, our system supports speculative execution of loops with moderate dependency densities and privatization of loops having only false dependencies on the GPU side. Our scheduler also supports task stealing and task sharing algorithms that allow swift load redistribution across GPU and CPU. We have carried out several experiments to evaluate the profiling overhead and up to 11 real-life applications to evaluate our system performance. Testing results show that the overhead is moderate compared with the sequential execution and prove that almost all the applications could benefit from our system.-
dc.languageeng-
dc.publisherThe University of Hong Kong (Pokfulam, Hong Kong)-
dc.relation.ispartofHKU Theses Online (HKUTO)-
dc.rightsThe author retains all proprietary rights, (such as patent rights) and the right to use in future works.-
dc.rightsCreative Commons: Attribution 3.0 Hong Kong License-
dc.source.urihttp://hub.hku.hk/bib/B50534257-
dc.subject.lcshGraphics processing units.-
dc.subject.lcshParallel processing (Electronic computers)-
dc.subject.lcshComputer architecture.-
dc.titleProfile-guided loop parallelization and co-scheduling on GPU-based heterogeneous many-core architectures-
dc.typePG_Thesis-
dc.identifier.hkulb5053425-
dc.description.thesisnameMaster of Philosophy-
dc.description.thesislevelMaster-
dc.description.thesisdisciplineComputer Science-
dc.description.naturepublished_or_final_version-
dc.identifier.doi10.5353/th_b5053425-
dc.date.hkucongregation2013-

Export via OAI-PMH Interface in XML Formats


OR


Export to Other Non-XML Formats