Run-time loop parallelization with efficient dependency checking on GPU-accelerated platforms

Zhang, Chenggang; 张呈刚

File Download

FullText.pdf

Links for fulltext

(May Require Subscription)

DOI: 10.5353/th_b4716765

Supplementary

Citations:
Appears in Collections:
- Computer Science & Information Systems: Theses
- HKU Theses Online

postgraduate thesis: Run-time loop parallelization with efficient dependency checking on GPU-accelerated platforms

Title	Run-time loop parallelization with efficient dependency checking on GPU-accelerated platforms
Authors	Zhang, Chenggang 张呈刚
Issue Date	2011
Publisher	The University of Hong Kong (Pokfulam, Hong Kong)
Citation	Zhang, C. [张呈刚]. (2011). Run-time loop parallelization with efficient dependency checking on GPU-accelerated platforms. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. Retrieved from http://dx.doi.org/10.5353/th_b4716765
Abstract	General-Purpose computing on Graphics Processing Units (GPGPU) has attracted a lot of attention recently. Exciting results have been reported in using GPUs to accelerate applications in various domains such as scientific simulations, data mining, bio-informatics and computational finance. However, up to now GPUs can only accelerate data-parallel loops with statically analyzable parallelism. Loops with dynamic parallelism (e.g., with array accesses through subscripted subscripts), an important pattern in many general-purpose applications, cannot be parallelized on GPUs using existing technologies. Run-time loop parallelization using Thread Level Speculation (TLS) has been proposed in the literatures to parallelize loops with statically un-analyzable dependencies. However, most of the existing TLS systems are designed for multiprocessor/multi-core CPUs. GPUs have fundamental differences with CPUs in both hardware architecture and execution model, making the previous TLS designs not work or inefficient when ported to GPUs. This thesis presents GPUTLS, a runtime system designed to support speculative loop parallelization on GPUs. The design of GPU-TLS addresses several key problems encountered when adapting TLS to GPUs: (1) To reduce the possibility of mis-speculation, deferred-update memory versioning scheme is adopted to avoid mis-speculations caused by inter-iteration WAR and WAW dependencies. A technique named intra-warp value forwarding is proposed to respect some inter-iteration RAW dependencies, which further reduces the mis-speculation possibility. (2) An incremental speculative execution scheme is designed to exploit partial parallelism within loops. This avoids excessive re-executions and reduces the mis-speculation penalty. (3) The dependency checking among thousands of speculative GPU threads poses large overhead and can easily become the performance bottleneck. To lower the overhead, we design several e_cient dependency checking schemes named PRW+BDC, SW, SR, SRW+EDC, and SRW+LDC respectively. (4) We devise a novel parallel commit scheme to avoid the overhead incurred by the serial commit phase in most existing TLS designs. We have carried out extensive experiments on two platforms with different NVIDIA GPUs, using both a synthetic loop that can simulate loops with different characteristics and several loops from real-life applications. Testing results show that the proposed intra-warp value forwarding and eager dependency checking techniques can improve the performance for almost all kinds of loop patterns. We observe that compared with other dependency checking schemes, SR and SW can achieve better performance in most cases. It is also shown that the proposed parallel commit scheme is especially useful for loops with large write set size and small number of inter-iteration WAW dependencies. Overall, GPU-TLS can achieve speedups ranging from 5 to 105 for loops with dynamic parallelism.
Degree	Master of Philosophy
Subject	Graphics processing units. Parallel processing (Electronic computers) Threads (Computer programs)
Dept/Program	Computer Science
Persistent Identifier	http://hdl.handle.net/10722/180014
HKU Library Item ID	b4716765

DC Field	Value	Language
dc.contributor.author	Zhang, Chenggang	-
dc.contributor.author	张呈刚	-
dc.date.issued	2011	-
dc.identifier.citation	Zhang, C. [张呈刚]. (2011). Run-time loop parallelization with efficient dependency checking on GPU-accelerated platforms. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. Retrieved from http://dx.doi.org/10.5353/th_b4716765	-
dc.identifier.uri	http://hdl.handle.net/10722/180014	-
dc.description.abstract	General-Purpose computing on Graphics Processing Units (GPGPU) has attracted a lot of attention recently. Exciting results have been reported in using GPUs to accelerate applications in various domains such as scientific simulations, data mining, bio-informatics and computational finance. However, up to now GPUs can only accelerate data-parallel loops with statically analyzable parallelism. Loops with dynamic parallelism (e.g., with array accesses through subscripted subscripts), an important pattern in many general-purpose applications, cannot be parallelized on GPUs using existing technologies. Run-time loop parallelization using Thread Level Speculation (TLS) has been proposed in the literatures to parallelize loops with statically un-analyzable dependencies. However, most of the existing TLS systems are designed for multiprocessor/multi-core CPUs. GPUs have fundamental differences with CPUs in both hardware architecture and execution model, making the previous TLS designs not work or inefficient when ported to GPUs. This thesis presents GPUTLS, a runtime system designed to support speculative loop parallelization on GPUs. The design of GPU-TLS addresses several key problems encountered when adapting TLS to GPUs: (1) To reduce the possibility of mis-speculation, deferred-update memory versioning scheme is adopted to avoid mis-speculations caused by inter-iteration WAR and WAW dependencies. A technique named intra-warp value forwarding is proposed to respect some inter-iteration RAW dependencies, which further reduces the mis-speculation possibility. (2) An incremental speculative execution scheme is designed to exploit partial parallelism within loops. This avoids excessive re-executions and reduces the mis-speculation penalty. (3) The dependency checking among thousands of speculative GPU threads poses large overhead and can easily become the performance bottleneck. To lower the overhead, we design several e_cient dependency checking schemes named PRW+BDC, SW, SR, SRW+EDC, and SRW+LDC respectively. (4) We devise a novel parallel commit scheme to avoid the overhead incurred by the serial commit phase in most existing TLS designs. We have carried out extensive experiments on two platforms with different NVIDIA GPUs, using both a synthetic loop that can simulate loops with different characteristics and several loops from real-life applications. Testing results show that the proposed intra-warp value forwarding and eager dependency checking techniques can improve the performance for almost all kinds of loop patterns. We observe that compared with other dependency checking schemes, SR and SW can achieve better performance in most cases. It is also shown that the proposed parallel commit scheme is especially useful for loops with large write set size and small number of inter-iteration WAW dependencies. Overall, GPU-TLS can achieve speedups ranging from 5 to 105 for loops with dynamic parallelism.	-
dc.language	eng	-
dc.publisher	The University of Hong Kong (Pokfulam, Hong Kong)	-
dc.relation.ispartof	HKU Theses Online (HKUTO)	-
dc.rights	The author retains all proprietary rights, (such as patent rights) and the right to use in future works.	-
dc.rights	This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.	-
dc.source.uri	http://hub.hku.hk/bib/B47167658	-
dc.subject.lcsh	Graphics processing units.	-
dc.subject.lcsh	Parallel processing (Electronic computers)	-
dc.subject.lcsh	Threads (Computer programs)	-
dc.title	Run-time loop parallelization with efficient dependency checking on GPU-accelerated platforms	-
dc.type	PG_Thesis	-
dc.identifier.hkul	b4716765	-
dc.description.thesisname	Master of Philosophy	-
dc.description.thesislevel	Master	-
dc.description.thesisdiscipline	Computer Science	-
dc.description.nature	published_or_final_version	-
dc.identifier.doi	10.5353/th_b4716765	-
dc.date.hkucongregation	2012	-
dc.identifier.mmsid	991032832679703414	-

File Download

Links for fulltext

(May Require Subscription)

Supplementary

postgraduate thesis: Run-time loop parallelization with efficient dependency checking on GPU-accelerated platforms

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats