On-GPU thread-data remapping for reducing control flow divergence

Lin, Huanxin; 林煥鑫

File Download

FullText.pdf

Links for fulltext

(May Require Subscription)

DOI: 10.5353/th_991044168860103414

Supplementary

Citations:
Appears in Collections:
- Computer Science: Theses
- HKU Theses Online

postgraduate thesis: On-GPU thread-data remapping for reducing control flow divergence

Title	On-GPU thread-data remapping for reducing control flow divergence
Authors	Lin, Huanxin 林煥鑫
Advisors	Advisor(s):Wang, CL
Issue Date	2019
Publisher	The University of Hong Kong (Pokfulam, Hong Kong)
Citation	Lin, H. [林煥鑫]. (2019). On-GPU thread-data remapping for reducing control flow divergence. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
Abstract	Graphics Processing Unit (GPU) is taking up an increasingly vital role in computation. Thanks to the Single Instruction Multiple Data (SIMD) execution model, GPU can provide massive computing power with low energy cost. However, it remains difficult to utilize GPU computation for general purpose, and control flow divergence is a major obstacle. GPU cores are grouped into SIMD units that share the same control circuit and thus execute every instruction in lockstep. When conditional statement is encountered, SIMD unit has to sequentially execute every divergent control path, resulting in unnecessary computation. Based on the syntax of conditional statement, control flow divergence can be divided into branch divergence and loop divergence, commonly found in applications with decision trees. Thread-data remapping (TDR) is the most widely-used software solution effective on both types of divergence. Threads in each SIMD unit are remapped to data that lead to the same control path. So far, however, TDR is merely performed as compile-time preprocessing on host end. It tries to evaluate conditional statements in advance and reorganize input data accordingly, but redundant computation is often needed for runtime results that the conditions depend on. As a compile-time solution, traditional TDR can only change thread-data mapping once for each computation. As a result, multiple instances of divergence in a kernel have to compromise with each other on an optimal mapping, so traditional TDR may miss out on speedup opportunities due to conflicting needs among divergence targets. Moreover, previous research focused mostly on branch divergence, but severe loop divergence has recently been found in GPU-accelerated routing. The highly-variant network traffic requires traditional TDR treatment every time with expensive data reorganization. This work presents On-GPU TDR as a GPU-runtime software solution. Given the characteristics of GPU architecture, TDR approaches are re-designed to be fully parallel and decentralized for GPU threads without costly data movement. Each divergence target is treated separately to avoid mapping conflicts and thus achieve full divergence reduction. As nested branches lower performance exponentially, On-GPU TDR is designed with a recursion scheme, so as to reduce redundant computation for treatment of inner branches. It features an inter-thread synchronization protocol that works for arbitrary number of threads on the same branch path. Loop divergence is also specifically addressed in On-GPU TDR. Native GPU scheduling is uninterrupted until a reduction opportunity is detected, and then cross-iteration TDR is performed to minimize overhead. On-GPU TDR has proved effective on state-of-the-art GPUs from both NVIDIA and AMD. Our designs successfully reduce overheads caused by global memory access, idle waiting, etc. For benchmarks with branch divergence, highest speedup is over 4 on all GPU models. Loop divergence in packet processing is also reduced more completely with On-GPU TDR, which prevents GPU computation from being the bottleneck and sustains processing throughputs close to maximum transfer bandwidth. The independence on host-end preprocessing preserves direct device-to-device data streaming, cutting processing latency at least by half compared with other works. On-GPU TDR reduces burden on the host machine, and improves both computation efficiency and transfer flexibility.
Degree	Doctor of Philosophy
Subject	Graphics processing units Computer architecture
Dept/Program	Computer Science
Persistent Identifier	http://hdl.handle.net/10722/279761

DC Field	Value	Language
dc.contributor.advisor	Wang, CL	-
dc.contributor.author	Lin, Huanxin	-
dc.contributor.author	林煥鑫	-
dc.date.accessioned	2019-12-10T10:04:47Z	-
dc.date.available	2019-12-10T10:04:47Z	-
dc.date.issued	2019	-
dc.identifier.citation	Lin, H. [林煥鑫]. (2019). On-GPU thread-data remapping for reducing control flow divergence. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.	-
dc.identifier.uri	http://hdl.handle.net/10722/279761	-
dc.description.abstract	Graphics Processing Unit (GPU) is taking up an increasingly vital role in computation. Thanks to the Single Instruction Multiple Data (SIMD) execution model, GPU can provide massive computing power with low energy cost. However, it remains difficult to utilize GPU computation for general purpose, and control flow divergence is a major obstacle. GPU cores are grouped into SIMD units that share the same control circuit and thus execute every instruction in lockstep. When conditional statement is encountered, SIMD unit has to sequentially execute every divergent control path, resulting in unnecessary computation. Based on the syntax of conditional statement, control flow divergence can be divided into branch divergence and loop divergence, commonly found in applications with decision trees. Thread-data remapping (TDR) is the most widely-used software solution effective on both types of divergence. Threads in each SIMD unit are remapped to data that lead to the same control path. So far, however, TDR is merely performed as compile-time preprocessing on host end. It tries to evaluate conditional statements in advance and reorganize input data accordingly, but redundant computation is often needed for runtime results that the conditions depend on. As a compile-time solution, traditional TDR can only change thread-data mapping once for each computation. As a result, multiple instances of divergence in a kernel have to compromise with each other on an optimal mapping, so traditional TDR may miss out on speedup opportunities due to conflicting needs among divergence targets. Moreover, previous research focused mostly on branch divergence, but severe loop divergence has recently been found in GPU-accelerated routing. The highly-variant network traffic requires traditional TDR treatment every time with expensive data reorganization. This work presents On-GPU TDR as a GPU-runtime software solution. Given the characteristics of GPU architecture, TDR approaches are re-designed to be fully parallel and decentralized for GPU threads without costly data movement. Each divergence target is treated separately to avoid mapping conflicts and thus achieve full divergence reduction. As nested branches lower performance exponentially, On-GPU TDR is designed with a recursion scheme, so as to reduce redundant computation for treatment of inner branches. It features an inter-thread synchronization protocol that works for arbitrary number of threads on the same branch path. Loop divergence is also specifically addressed in On-GPU TDR. Native GPU scheduling is uninterrupted until a reduction opportunity is detected, and then cross-iteration TDR is performed to minimize overhead. On-GPU TDR has proved effective on state-of-the-art GPUs from both NVIDIA and AMD. Our designs successfully reduce overheads caused by global memory access, idle waiting, etc. For benchmarks with branch divergence, highest speedup is over 4 on all GPU models. Loop divergence in packet processing is also reduced more completely with On-GPU TDR, which prevents GPU computation from being the bottleneck and sustains processing throughputs close to maximum transfer bandwidth. The independence on host-end preprocessing preserves direct device-to-device data streaming, cutting processing latency at least by half compared with other works. On-GPU TDR reduces burden on the host machine, and improves both computation efficiency and transfer flexibility.	-
dc.language	eng	-
dc.publisher	The University of Hong Kong (Pokfulam, Hong Kong)	-
dc.relation.ispartof	HKU Theses Online (HKUTO)	-
dc.rights	The author retains all proprietary rights, (such as patent rights) and the right to use in future works.	-
dc.rights	This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.	-
dc.subject.lcsh	Graphics processing units	-
dc.subject.lcsh	Computer architecture	-
dc.title	On-GPU thread-data remapping for reducing control flow divergence	-
dc.type	PG_Thesis	-
dc.description.thesisname	Doctor of Philosophy	-
dc.description.thesislevel	Doctoral	-
dc.description.thesisdiscipline	Computer Science	-
dc.description.nature	published_or_final_version	-
dc.identifier.doi	10.5353/th_991044168860103414	-
dc.date.hkucongregation	2019	-
dc.identifier.mmsid	991044168860103414	-

File Download

Links for fulltext

(May Require Subscription)

Supplementary

postgraduate thesis: On-GPU thread-data remapping for reducing control flow divergence

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats