Near-Optimal Topology-adaptive Parameter Synchronization in Distributed DNN Training

Zhang, Z; Wu, C; Li, Z

File Download

There are no files associated with this item.

Links for fulltext

(May Require Subscription)

Publisher Website: 10.1109/INFOCOM42981.2021.9488678
Scopus: eid_2-s2.0-85111944048
WOS: WOS:000702210400015
Find via

Supplementary

Citations:
- Scopus: 0
- Web of Science: 0
Appears in Collections:
- Computer Science: Conference papers

Conference Paper: Near-Optimal Topology-adaptive Parameter Synchronization in Distributed DNN Training

Title	Near-Optimal Topology-adaptive Parameter Synchronization in Distributed DNN Training
Authors	Zhang, Z Wu, C Li, Z
Issue Date	2021
Publisher	IEEE Computer Society. The Journal's web site is located at http://ieeexplore.ieee.org/xpl/conhome.jsp?punumber=1000359
Citation	IEEE International Conference on Computer Communications (INFOCOM), Virtual Conference, Vancouver, BC, Canada, 10-13 May 2021, p. 1-10 How to Cite? DOI: http://dx.doi.org/10.1109/INFOCOM42981.2021.9488678
Abstract	Distributed machine learning with multiple concurrent workers has been widely adopted to train large deep neural networks (DNNs). Parameter synchronization is a key component in each iteration of distributed training, where workers exchange locally computed gradients through an AllReduce operation or parameter servers, for global parameter updates. Parameter synchronization often constitutes a significant portion of the training time; minimizing the communication time contributes substantially to DNN training speed-up. Standard ring-based AllReduce or PS architecture work efficiently mostly with homogeneous inter-worker connectivity. However, available bandwidth among workers in real-world clusters is often heterogeneous, due to different hardware configurations, switching topologies, and contention with concurrent jobs. This work investigates the best parameter synchronization topology and schedule among workers for most expedited communication in distributed DNN training. We show that the optimal parameter synchronization topology should be comprised of trees with different workers as roots, each for aggregating or broadcasting a partition of gradients/parameters. We identify near-optimal forest packing to maximally utilize available bandwidth and overlap aggregation and broadcast stages to minimize communication time. We provide theoretical analysis of the performance bound, and show that our scheme outperforms state-of-the-art parameter synchronization schemes by up to 18.3 times with extensive evaluation under various settings.
Persistent Identifier	http://hdl.handle.net/10722/301414
ISSN	0743-166X 2023 SCImago Journal Rankings: 2.865
ISI Accession Number ID	WOS:000702210400015

DC Field	Value	Language
dc.contributor.author	Zhang, Z	-
dc.contributor.author	Wu, C	-
dc.contributor.author	Li, Z	-
dc.date.accessioned	2021-07-27T08:10:43Z	-
dc.date.available	2021-07-27T08:10:43Z	-
dc.date.issued	2021	-
dc.identifier.citation	IEEE International Conference on Computer Communications (INFOCOM), Virtual Conference, Vancouver, BC, Canada, 10-13 May 2021, p. 1-10	-
dc.identifier.issn	0743-166X	-
dc.identifier.uri	http://hdl.handle.net/10722/301414	-
dc.description.abstract	Distributed machine learning with multiple concurrent workers has been widely adopted to train large deep neural networks (DNNs). Parameter synchronization is a key component in each iteration of distributed training, where workers exchange locally computed gradients through an AllReduce operation or parameter servers, for global parameter updates. Parameter synchronization often constitutes a significant portion of the training time; minimizing the communication time contributes substantially to DNN training speed-up. Standard ring-based AllReduce or PS architecture work efficiently mostly with homogeneous inter-worker connectivity. However, available bandwidth among workers in real-world clusters is often heterogeneous, due to different hardware configurations, switching topologies, and contention with concurrent jobs. This work investigates the best parameter synchronization topology and schedule among workers for most expedited communication in distributed DNN training. We show that the optimal parameter synchronization topology should be comprised of trees with different workers as roots, each for aggregating or broadcasting a partition of gradients/parameters. We identify near-optimal forest packing to maximally utilize available bandwidth and overlap aggregation and broadcast stages to minimize communication time. We provide theoretical analysis of the performance bound, and show that our scheme outperforms state-of-the-art parameter synchronization schemes by up to 18.3 times with extensive evaluation under various settings.	-
dc.language	eng	-
dc.publisher	IEEE Computer Society. The Journal's web site is located at http://ieeexplore.ieee.org/xpl/conhome.jsp?punumber=1000359	-
dc.relation.ispartof	IEEE INFOCOM - IEEE Conference on Computer Communications	-
dc.rights	IEEE INFOCOM - IEEE Conference on Computer Communications. Copyright © IEEE Computer Society.	-
dc.rights	©2021 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.	-
dc.title	Near-Optimal Topology-adaptive Parameter Synchronization in Distributed DNN Training	-
dc.type	Conference_Paper	-
dc.identifier.email	Wu, C: cwu@cs.hku.hk	-
dc.identifier.authority	Wu, C=rp01397	-
dc.description.nature	link_to_subscribed_fulltext	-
dc.identifier.doi	10.1109/INFOCOM42981.2021.9488678	-
dc.identifier.scopus	eid_2-s2.0-85111944048	-
dc.identifier.hkuros	323509	-
dc.identifier.spage	1	-
dc.identifier.epage	10	-
dc.identifier.isi	WOS:000702210400015	-
dc.publisher.place	United States	-

File Download

Links for fulltext

(May Require Subscription)

Supplementary

Conference Paper: Near-Optimal Topology-adaptive Parameter Synchronization in Distributed DNN Training

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats