File Download

There are no files associated with this item.

  Links for fulltext
     (May Require Subscription)

Article: Accelerating Geo-Distributed Machine Learning With Network-Aware Adaptive Tree and Auxiliary Route

TitleAccelerating Geo-Distributed Machine Learning With Network-Aware Adaptive Tree and Auxiliary Route
Authors
Keywordscommunication scheduling
Geo-distributed ML
multipath transmission
synchronization topology
Issue Date2024
Citation
IEEE/ACM Transactions on Networking, 2024, v. 32, n. 5, p. 4238-4253 How to Cite?
AbstractDistributed machine learning is becoming increasingly popular for geo-distributed data analytics, facilitating the collaborative analysis of data scattered across data centers in different regions. This paradigm eliminates the need for centralizing sensitive raw data in one location but faces the significant challenge of high parameter synchronization delays, which stems from the constraints of bandwidth-limited, heterogeneous, and fluctuating wide-area networks. Prior research has focused on optimizing the synchronization topology, evolving from starlike to tree-based structures. However, these solutions typically depend on regular tree structures and lack an adequate topology metric, resulting in limited improvements. This paper proposes NetStorm, an adaptive and highly efficient communication scheduler designed to speed up parameter synchronization across geo-distributed data centers. First, it establishes an effective metric for optimizing a multi-root FAPT synchronization topology. Second, a network awareness module is developed to acquire network knowledge, aiding in topology decisions. Third, a multipath auxiliary transmission mechanism is introduced to enhance network awareness and facilitate multipath transmissions. Lastly, we design policy consistency protocols to guarantee seamless updates of transmission policies. Empirical results demonstrate that NetStorm significantly outperforms distributed training systems like MXNET, MLNET, and TSEngine, with a speedup of 6.59.2 times over MXNET.
Persistent Identifierhttp://hdl.handle.net/10722/353189
ISSN
2023 Impact Factor: 3.0
2023 SCImago Journal Rankings: 2.034

 

DC FieldValueLanguage
dc.contributor.authorLi, Zonghang-
dc.contributor.authorFeng, Wenjiao-
dc.contributor.authorCai, Weibo-
dc.contributor.authorYu, Hongfang-
dc.contributor.authorLuo, Long-
dc.contributor.authorSun, Gang-
dc.contributor.authorDu, Hongyang-
dc.contributor.authorNiyato, Dusit-
dc.date.accessioned2025-01-13T03:02:32Z-
dc.date.available2025-01-13T03:02:32Z-
dc.date.issued2024-
dc.identifier.citationIEEE/ACM Transactions on Networking, 2024, v. 32, n. 5, p. 4238-4253-
dc.identifier.issn1063-6692-
dc.identifier.urihttp://hdl.handle.net/10722/353189-
dc.description.abstractDistributed machine learning is becoming increasingly popular for geo-distributed data analytics, facilitating the collaborative analysis of data scattered across data centers in different regions. This paradigm eliminates the need for centralizing sensitive raw data in one location but faces the significant challenge of high parameter synchronization delays, which stems from the constraints of bandwidth-limited, heterogeneous, and fluctuating wide-area networks. Prior research has focused on optimizing the synchronization topology, evolving from starlike to tree-based structures. However, these solutions typically depend on regular tree structures and lack an adequate topology metric, resulting in limited improvements. This paper proposes NetStorm, an adaptive and highly efficient communication scheduler designed to speed up parameter synchronization across geo-distributed data centers. First, it establishes an effective metric for optimizing a multi-root FAPT synchronization topology. Second, a network awareness module is developed to acquire network knowledge, aiding in topology decisions. Third, a multipath auxiliary transmission mechanism is introduced to enhance network awareness and facilitate multipath transmissions. Lastly, we design policy consistency protocols to guarantee seamless updates of transmission policies. Empirical results demonstrate that NetStorm significantly outperforms distributed training systems like MXNET, MLNET, and TSEngine, with a speedup of 6.59.2 times over MXNET.-
dc.languageeng-
dc.relation.ispartofIEEE/ACM Transactions on Networking-
dc.subjectcommunication scheduling-
dc.subjectGeo-distributed ML-
dc.subjectmultipath transmission-
dc.subjectsynchronization topology-
dc.titleAccelerating Geo-Distributed Machine Learning With Network-Aware Adaptive Tree and Auxiliary Route-
dc.typeArticle-
dc.description.naturelink_to_subscribed_fulltext-
dc.identifier.doi10.1109/TNET.2024.3412429-
dc.identifier.scopuseid_2-s2.0-85196085821-
dc.identifier.volume32-
dc.identifier.issue5-
dc.identifier.spage4238-
dc.identifier.epage4253-
dc.identifier.eissn1558-2566-

Export via OAI-PMH Interface in XML Formats


OR


Export to Other Non-XML Formats