File Download
  Links for fulltext
     (May Require Subscription)
Supplementary

Conference Paper: GPU-TLS: an efficient runtime for speculative loop parallelization on GPUs

TitleGPU-TLS: an efficient runtime for speculative loop parallelization on GPUs
Authors
KeywordsGPGPU
Speculative loop parallelization
Thread-level speculation (TLS)
GPU-TLS
Issue Date2013
PublisherIEEE Computer Society. The Journal's website is located at http://ieeexplore.ieee.org/xpl/conhome.jsp?punumber=1000093
Citation
The 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing (CCGrid 2013), Delft, Netherlands, 13-16 May 2013. In Conference Proceedings, 2013, p. 120-127 How to Cite?
AbstractRecently GPUs have risen as one important parallel platform for general purpose applications, both in HPC and cloud environments. Due to the special execution model, developing programs for GPUs is difficult even with the recent introduction of high-level languages like CUDA and OpenCL. To ease the programming efforts, some research has proposed automatically generating parallel GPU codes by complex compile-time techniques. However, this approach can only parallelize loops 100% free of inter-iteration dependencies (i.e., DOALL loops). To exploit runtime parallelism, which cannot be proven by static analysis, in this work, we propose GPU-TLS, a runtime system to speculatively parallelize possibly-parallel loops in sequential programs on GPUs. GPU-TLS parallelizes a possibly-parallel loop by chopping it into smaller sub-loops, each of which is executed in parallel by a GPU kernel, speculating that no inter-iteration dependencies exist. After dependency checking, the buffered writes of iterations without mis-speculations are copied to the master memory while iterations encountering mis-speculations are re-executed. GPU-TLS addresses several key problems of speculative loop parallelization on GPUs: (1) The larger mis-speculation rate caused by larger number of threads is reduced by three approaches: the loop chopping parallelization approach, the deferred memory update scheme and intra-warp value forwarding method. (2) The larger overhead of dependency checking is reduced by a hybrid scheme: eager intra-warp dependency checking combined with lazy inter-warp dependency checking. (3) The bottleneck of serial commit is alleviated by a parallel commit scheme, which allows different iterations to enter the commit phase out of order but still guarantees sequential semantics. Extensive evaluations using both microbenchmarks and reallife applications on two recent NVIDIA GPU cards show that speculative loop parallelization using GPU-TLS can achieve speedups ranging from 5 to 160 for sequential programs with possibly-parallel loops. © 2013 IEEE.
Persistent Identifierhttp://hdl.handle.net/10722/189638
ISBN
ISI Accession Number ID

 

DC FieldValueLanguage
dc.contributor.authorZhang, Cen_US
dc.contributor.authorHan, Gen_US
dc.contributor.authorWang, CLen_US
dc.date.accessioned2013-09-17T14:50:33Z-
dc.date.available2013-09-17T14:50:33Z-
dc.date.issued2013en_US
dc.identifier.citationThe 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing (CCGrid 2013), Delft, Netherlands, 13-16 May 2013. In Conference Proceedings, 2013, p. 120-127en_US
dc.identifier.isbn978-0-7695-4996-5-
dc.identifier.urihttp://hdl.handle.net/10722/189638-
dc.description.abstractRecently GPUs have risen as one important parallel platform for general purpose applications, both in HPC and cloud environments. Due to the special execution model, developing programs for GPUs is difficult even with the recent introduction of high-level languages like CUDA and OpenCL. To ease the programming efforts, some research has proposed automatically generating parallel GPU codes by complex compile-time techniques. However, this approach can only parallelize loops 100% free of inter-iteration dependencies (i.e., DOALL loops). To exploit runtime parallelism, which cannot be proven by static analysis, in this work, we propose GPU-TLS, a runtime system to speculatively parallelize possibly-parallel loops in sequential programs on GPUs. GPU-TLS parallelizes a possibly-parallel loop by chopping it into smaller sub-loops, each of which is executed in parallel by a GPU kernel, speculating that no inter-iteration dependencies exist. After dependency checking, the buffered writes of iterations without mis-speculations are copied to the master memory while iterations encountering mis-speculations are re-executed. GPU-TLS addresses several key problems of speculative loop parallelization on GPUs: (1) The larger mis-speculation rate caused by larger number of threads is reduced by three approaches: the loop chopping parallelization approach, the deferred memory update scheme and intra-warp value forwarding method. (2) The larger overhead of dependency checking is reduced by a hybrid scheme: eager intra-warp dependency checking combined with lazy inter-warp dependency checking. (3) The bottleneck of serial commit is alleviated by a parallel commit scheme, which allows different iterations to enter the commit phase out of order but still guarantees sequential semantics. Extensive evaluations using both microbenchmarks and reallife applications on two recent NVIDIA GPU cards show that speculative loop parallelization using GPU-TLS can achieve speedups ranging from 5 to 160 for sequential programs with possibly-parallel loops. © 2013 IEEE.-
dc.languageengen_US
dc.publisherIEEE Computer Society. The Journal's website is located at http://ieeexplore.ieee.org/xpl/conhome.jsp?punumber=1000093-
dc.relation.ispartofIEEE/ACM International Symposium on Cluster, Cloud and Grid Computing Proceedingsen_US
dc.subjectGPGPU-
dc.subjectSpeculative loop parallelization-
dc.subjectThread-level speculation (TLS)-
dc.subjectGPU-TLS-
dc.titleGPU-TLS: an efficient runtime for speculative loop parallelization on GPUsen_US
dc.typeConference_Paperen_US
dc.identifier.emailZhang, C: cgzhang@cs.hku.hken_US
dc.identifier.emailHan, G: gdhan@cs.hku.hk-
dc.identifier.emailWang, CL: clwang@cs.hku.hk-
dc.identifier.authorityWang, CL=rp00183en_US
dc.description.naturelink_to_subscribed_fulltext-
dc.identifier.doi10.1109/CCGrid.2013.34-
dc.identifier.scopuseid_2-s2.0-84881269753-
dc.identifier.hkuros223371en_US
dc.identifier.spage120-
dc.identifier.epage127-
dc.identifier.isiWOS:000325006300019-
dc.publisher.placeUnited Statesen_US
dc.customcontrol.immutablesml 131002-

Export via OAI-PMH Interface in XML Formats


OR


Export to Other Non-XML Formats