File Download
  Links for fulltext
     (May Require Subscription)
Supplementary

postgraduate thesis: On-GPU thread-data remapping for reducing control flow divergence

TitleOn-GPU thread-data remapping for reducing control flow divergence
Authors
Advisors
Advisor(s):Wang, CL
Issue Date2019
PublisherThe University of Hong Kong (Pokfulam, Hong Kong)
Citation
Lin, H. [林煥鑫]. (2019). On-GPU thread-data remapping for reducing control flow divergence. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
AbstractGraphics Processing Unit (GPU) is taking up an increasingly vital role in computation. Thanks to the Single Instruction Multiple Data (SIMD) execution model, GPU can provide massive computing power with low energy cost. However, it remains difficult to utilize GPU computation for general purpose, and control flow divergence is a major obstacle. GPU cores are grouped into SIMD units that share the same control circuit and thus execute every instruction in lockstep. When conditional statement is encountered, SIMD unit has to sequentially execute every divergent control path, resulting in unnecessary computation. Based on the syntax of conditional statement, control flow divergence can be divided into branch divergence and loop divergence, commonly found in applications with decision trees. Thread-data remapping (TDR) is the most widely-used software solution effective on both types of divergence. Threads in each SIMD unit are remapped to data that lead to the same control path. So far, however, TDR is merely performed as compile-time preprocessing on host end. It tries to evaluate conditional statements in advance and reorganize input data accordingly, but redundant computation is often needed for runtime results that the conditions depend on. As a compile-time solution, traditional TDR can only change thread-data mapping once for each computation. As a result, multiple instances of divergence in a kernel have to compromise with each other on an optimal mapping, so traditional TDR may miss out on speedup opportunities due to conflicting needs among divergence targets. Moreover, previous research focused mostly on branch divergence, but severe loop divergence has recently been found in GPU-accelerated routing. The highly-variant network traffic requires traditional TDR treatment every time with expensive data reorganization. This work presents On-GPU TDR as a GPU-runtime software solution. Given the characteristics of GPU architecture, TDR approaches are re-designed to be fully parallel and decentralized for GPU threads without costly data movement. Each divergence target is treated separately to avoid mapping conflicts and thus achieve full divergence reduction. As nested branches lower performance exponentially, On-GPU TDR is designed with a recursion scheme, so as to reduce redundant computation for treatment of inner branches. It features an inter-thread synchronization protocol that works for arbitrary number of threads on the same branch path. Loop divergence is also specifically addressed in On-GPU TDR. Native GPU scheduling is uninterrupted until a reduction opportunity is detected, and then cross-iteration TDR is performed to minimize overhead. On-GPU TDR has proved effective on state-of-the-art GPUs from both NVIDIA and AMD. Our designs successfully reduce overheads caused by global memory access, idle waiting, etc. For benchmarks with branch divergence, highest speedup is over 4 on all GPU models. Loop divergence in packet processing is also reduced more completely with On-GPU TDR, which prevents GPU computation from being the bottleneck and sustains processing throughputs close to maximum transfer bandwidth. The independence on host-end preprocessing preserves direct device-to-device data streaming, cutting processing latency at least by half compared with other works. On-GPU TDR reduces burden on the host machine, and improves both computation efficiency and transfer flexibility.
DegreeDoctor of Philosophy
SubjectGraphics processing units
Computer architecture
Dept/ProgramComputer Science
Persistent Identifierhttp://hdl.handle.net/10722/279761

 

DC FieldValueLanguage
dc.contributor.advisorWang, CL-
dc.contributor.authorLin, Huanxin-
dc.contributor.author林煥鑫-
dc.date.accessioned2019-12-10T10:04:47Z-
dc.date.available2019-12-10T10:04:47Z-
dc.date.issued2019-
dc.identifier.citationLin, H. [林煥鑫]. (2019). On-GPU thread-data remapping for reducing control flow divergence. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.-
dc.identifier.urihttp://hdl.handle.net/10722/279761-
dc.description.abstractGraphics Processing Unit (GPU) is taking up an increasingly vital role in computation. Thanks to the Single Instruction Multiple Data (SIMD) execution model, GPU can provide massive computing power with low energy cost. However, it remains difficult to utilize GPU computation for general purpose, and control flow divergence is a major obstacle. GPU cores are grouped into SIMD units that share the same control circuit and thus execute every instruction in lockstep. When conditional statement is encountered, SIMD unit has to sequentially execute every divergent control path, resulting in unnecessary computation. Based on the syntax of conditional statement, control flow divergence can be divided into branch divergence and loop divergence, commonly found in applications with decision trees. Thread-data remapping (TDR) is the most widely-used software solution effective on both types of divergence. Threads in each SIMD unit are remapped to data that lead to the same control path. So far, however, TDR is merely performed as compile-time preprocessing on host end. It tries to evaluate conditional statements in advance and reorganize input data accordingly, but redundant computation is often needed for runtime results that the conditions depend on. As a compile-time solution, traditional TDR can only change thread-data mapping once for each computation. As a result, multiple instances of divergence in a kernel have to compromise with each other on an optimal mapping, so traditional TDR may miss out on speedup opportunities due to conflicting needs among divergence targets. Moreover, previous research focused mostly on branch divergence, but severe loop divergence has recently been found in GPU-accelerated routing. The highly-variant network traffic requires traditional TDR treatment every time with expensive data reorganization. This work presents On-GPU TDR as a GPU-runtime software solution. Given the characteristics of GPU architecture, TDR approaches are re-designed to be fully parallel and decentralized for GPU threads without costly data movement. Each divergence target is treated separately to avoid mapping conflicts and thus achieve full divergence reduction. As nested branches lower performance exponentially, On-GPU TDR is designed with a recursion scheme, so as to reduce redundant computation for treatment of inner branches. It features an inter-thread synchronization protocol that works for arbitrary number of threads on the same branch path. Loop divergence is also specifically addressed in On-GPU TDR. Native GPU scheduling is uninterrupted until a reduction opportunity is detected, and then cross-iteration TDR is performed to minimize overhead. On-GPU TDR has proved effective on state-of-the-art GPUs from both NVIDIA and AMD. Our designs successfully reduce overheads caused by global memory access, idle waiting, etc. For benchmarks with branch divergence, highest speedup is over 4 on all GPU models. Loop divergence in packet processing is also reduced more completely with On-GPU TDR, which prevents GPU computation from being the bottleneck and sustains processing throughputs close to maximum transfer bandwidth. The independence on host-end preprocessing preserves direct device-to-device data streaming, cutting processing latency at least by half compared with other works. On-GPU TDR reduces burden on the host machine, and improves both computation efficiency and transfer flexibility.-
dc.languageeng-
dc.publisherThe University of Hong Kong (Pokfulam, Hong Kong)-
dc.relation.ispartofHKU Theses Online (HKUTO)-
dc.rightsThe author retains all proprietary rights, (such as patent rights) and the right to use in future works.-
dc.rightsThis work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.-
dc.subject.lcshGraphics processing units-
dc.subject.lcshComputer architecture-
dc.titleOn-GPU thread-data remapping for reducing control flow divergence-
dc.typePG_Thesis-
dc.description.thesisnameDoctor of Philosophy-
dc.description.thesislevelDoctoral-
dc.description.thesisdisciplineComputer Science-
dc.description.naturepublished_or_final_version-
dc.identifier.doi10.5353/th_991044168860103414-
dc.date.hkucongregation2019-
dc.identifier.mmsid991044168860103414-

Export via OAI-PMH Interface in XML Formats


OR


Export to Other Non-XML Formats