Accelerating Geo-Distributed Machine Learning With Network-Aware Adaptive Tree and Auxiliary Route

Li, Zonghang; Feng, Wenjiao; Cai, Weibo; Yu, Hongfang; Luo, Long; Sun, Gang; Du, Hongyang; Niyato, Dusit

File Download

There are no files associated with this item.

Links for fulltext

(May Require Subscription)

Publisher Website: 10.1109/TNET.2024.3412429
Scopus: eid_2-s2.0-85196085821
Find via

Supplementary

Citations:
- Scopus: 0
Appears in Collections:
- Electrical & Electronic Engineering: Journal/Magazine Articles

See more details

Article: Accelerating Geo-Distributed Machine Learning With Network-Aware Adaptive Tree and Auxiliary Route

Title	Accelerating Geo-Distributed Machine Learning With Network-Aware Adaptive Tree and Auxiliary Route
Authors	Li, Zonghang Feng, Wenjiao Cai, Weibo Yu, Hongfang Luo, Long Sun, Gang Du, Hongyang Niyato, Dusit
Keywords	communication scheduling Geo-distributed ML multipath transmission synchronization topology
Issue Date	2024
Citation	IEEE/ACM Transactions on Networking, 2024, v. 32, n. 5, p. 4238-4253 How to Cite? DOI: http://dx.doi.org/10.1109/TNET.2024.3412429
Abstract	Distributed machine learning is becoming increasingly popular for geo-distributed data analytics, facilitating the collaborative analysis of data scattered across data centers in different regions. This paradigm eliminates the need for centralizing sensitive raw data in one location but faces the significant challenge of high parameter synchronization delays, which stems from the constraints of bandwidth-limited, heterogeneous, and fluctuating wide-area networks. Prior research has focused on optimizing the synchronization topology, evolving from starlike to tree-based structures. However, these solutions typically depend on regular tree structures and lack an adequate topology metric, resulting in limited improvements. This paper proposes NetStorm, an adaptive and highly efficient communication scheduler designed to speed up parameter synchronization across geo-distributed data centers. First, it establishes an effective metric for optimizing a multi-root FAPT synchronization topology. Second, a network awareness module is developed to acquire network knowledge, aiding in topology decisions. Third, a multipath auxiliary transmission mechanism is introduced to enhance network awareness and facilitate multipath transmissions. Lastly, we design policy consistency protocols to guarantee seamless updates of transmission policies. Empirical results demonstrate that NetStorm significantly outperforms distributed training systems like MXNET, MLNET, and TSEngine, with a speedup of 6.59.2 times over MXNET.
Persistent Identifier	http://hdl.handle.net/10722/353189
ISSN	1063-6692 2023 Impact Factor: 3.0 2023 SCImago Journal Rankings: 2.034

DC Field	Value	Language
dc.contributor.author	Li, Zonghang	-
dc.contributor.author	Feng, Wenjiao	-
dc.contributor.author	Cai, Weibo	-
dc.contributor.author	Yu, Hongfang	-
dc.contributor.author	Luo, Long	-
dc.contributor.author	Sun, Gang	-
dc.contributor.author	Du, Hongyang	-
dc.contributor.author	Niyato, Dusit	-
dc.date.accessioned	2025-01-13T03:02:32Z	-
dc.date.available	2025-01-13T03:02:32Z	-
dc.date.issued	2024	-
dc.identifier.citation	IEEE/ACM Transactions on Networking, 2024, v. 32, n. 5, p. 4238-4253	-
dc.identifier.issn	1063-6692	-
dc.identifier.uri	http://hdl.handle.net/10722/353189	-
dc.description.abstract	Distributed machine learning is becoming increasingly popular for geo-distributed data analytics, facilitating the collaborative analysis of data scattered across data centers in different regions. This paradigm eliminates the need for centralizing sensitive raw data in one location but faces the significant challenge of high parameter synchronization delays, which stems from the constraints of bandwidth-limited, heterogeneous, and fluctuating wide-area networks. Prior research has focused on optimizing the synchronization topology, evolving from starlike to tree-based structures. However, these solutions typically depend on regular tree structures and lack an adequate topology metric, resulting in limited improvements. This paper proposes NetStorm, an adaptive and highly efficient communication scheduler designed to speed up parameter synchronization across geo-distributed data centers. First, it establishes an effective metric for optimizing a multi-root FAPT synchronization topology. Second, a network awareness module is developed to acquire network knowledge, aiding in topology decisions. Third, a multipath auxiliary transmission mechanism is introduced to enhance network awareness and facilitate multipath transmissions. Lastly, we design policy consistency protocols to guarantee seamless updates of transmission policies. Empirical results demonstrate that NetStorm significantly outperforms distributed training systems like MXNET, MLNET, and TSEngine, with a speedup of 6.59.2 times over MXNET.	-
dc.language	eng	-
dc.relation.ispartof	IEEE/ACM Transactions on Networking	-
dc.subject	communication scheduling	-
dc.subject	Geo-distributed ML	-
dc.subject	multipath transmission	-
dc.subject	synchronization topology	-
dc.title	Accelerating Geo-Distributed Machine Learning With Network-Aware Adaptive Tree and Auxiliary Route	-
dc.type	Article	-
dc.description.nature	link_to_subscribed_fulltext	-
dc.identifier.doi	10.1109/TNET.2024.3412429	-
dc.identifier.scopus	eid_2-s2.0-85196085821	-
dc.identifier.volume	32	-
dc.identifier.issue	5	-
dc.identifier.spage	4238	-
dc.identifier.epage	4253	-
dc.identifier.eissn	1558-2566	-

File Download

Links for fulltext

(May Require Subscription)

Supplementary

Article: Accelerating Geo-Distributed Machine Learning With Network-Aware Adaptive Tree and Auxiliary Route

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats