File Download

There are no files associated with this item.

  Links for fulltext
     (May Require Subscription)
Supplementary

Conference Paper: Accelerating Large-Scale Distributed Neural Network Training with SPMD Parallelism

TitleAccelerating Large-Scale Distributed Neural Network Training with SPMD Parallelism
Authors
KeywordsDistributed system
Neural networks
Pipeline parallelism
Issue Date2022
PublisherAssociation for Computing Machinery.
Citation
The 13th ACM Symposium on Cloud Computing (SOCC’22), San Francisco, CA, United States, November 8-10, 2022. In SoCC '22: Proceedings of the 13th Symposium on Cloud Computing, p. 403-418 How to Cite?
AbstractDeep neural networks (DNNs) with trillions of parameters have emerged, e.g., Mixture-of-Experts (MoE) models. Training models of this scale requires sophisticated parallelization strategies like the newly proposed SPMD parallelism, that shards each tensor along different dimensions. A common problem using SPMD is that computation stalls during communication due to data dependencies, resulting in low GPU utilization and long training time. We present a general technique to accelerate SPMD-based DNN training by maximizing computation-communication overlap and automatic SPMD strategy search. The key idea is to duplicate the DNN model into two copies that have no dependency, and interleave their execution such that computation of one copy overlaps with communication of the other. We propose a dynamic programming algorithm to automatically identify optimized sharding strategies that minimize model training time by maximally enabling computation-communication overlap. Experiments show that our designs achieve up to 61% training speed-up as compared to existing frameworks
Persistent Identifierhttp://hdl.handle.net/10722/320624

 

DC FieldValueLanguage
dc.contributor.authorZhang, S-
dc.contributor.authorDiao, L-
dc.contributor.authorWu, C-
dc.contributor.authorWang, S-
dc.contributor.authorLin, W-
dc.date.accessioned2022-10-21T07:56:50Z-
dc.date.available2022-10-21T07:56:50Z-
dc.date.issued2022-
dc.identifier.citationThe 13th ACM Symposium on Cloud Computing (SOCC’22), San Francisco, CA, United States, November 8-10, 2022. In SoCC '22: Proceedings of the 13th Symposium on Cloud Computing, p. 403-418-
dc.identifier.urihttp://hdl.handle.net/10722/320624-
dc.description.abstractDeep neural networks (DNNs) with trillions of parameters have emerged, e.g., Mixture-of-Experts (MoE) models. Training models of this scale requires sophisticated parallelization strategies like the newly proposed SPMD parallelism, that shards each tensor along different dimensions. A common problem using SPMD is that computation stalls during communication due to data dependencies, resulting in low GPU utilization and long training time. We present a general technique to accelerate SPMD-based DNN training by maximizing computation-communication overlap and automatic SPMD strategy search. The key idea is to duplicate the DNN model into two copies that have no dependency, and interleave their execution such that computation of one copy overlaps with communication of the other. We propose a dynamic programming algorithm to automatically identify optimized sharding strategies that minimize model training time by maximally enabling computation-communication overlap. Experiments show that our designs achieve up to 61% training speed-up as compared to existing frameworks-
dc.languageeng-
dc.publisherAssociation for Computing Machinery.-
dc.relation.ispartofSoCC '22: Proceedings of the 13th Symposium on Cloud Computing-
dc.rightsSoCC '22: Proceedings of the 13th Symposium on Cloud Computing. Copyright © Association for Computing Machinery.-
dc.subjectDistributed system-
dc.subjectNeural networks-
dc.subjectPipeline parallelism-
dc.titleAccelerating Large-Scale Distributed Neural Network Training with SPMD Parallelism-
dc.typeConference_Paper-
dc.identifier.emailWu, C: cwu@cs.hku.hk-
dc.identifier.authorityWu, C=rp01397-
dc.identifier.doi10.1145/3542929.3563487-
dc.identifier.hkuros340525-
dc.identifier.spage403-
dc.identifier.epage418-
dc.publisher.placeUnited States-

Export via OAI-PMH Interface in XML Formats


OR


Export to Other Non-XML Formats