File Download

There are no files associated with this item.

  Links for fulltext
     (May Require Subscription)
Supplementary

Article: Optimizing Task Placement and Online Scheduling for Distributed GNN Training Acceleration in Heterogeneous Systems

TitleOptimizing Task Placement and Online Scheduling for Distributed GNN Training Acceleration in Heterogeneous Systems
Authors
KeywordsComputational modeling
Data transfer
Distributed databases
Distributed machine learning systems
graph neural network
Graph neural networks
online scheduling
Optimal scheduling
Task analysis
Training
Issue Date19-Jun-2024
PublisherInstitute of Electrical and Electronics Engineers
Citation
IEEE/ACM Transactions on Networking, 2024 How to Cite?
AbstractTraining Graph Neural Networks (GNNs) on large graphs is resource-intensive and time-consuming, mainly due to the large graph data that cannot be fit into the memory of a single machine, but have to be fetched from distributed graph storage and processed on the go. Unlike distributed deep neural network (DNN) training, the bottleneck in distributed GNN training lies largely in large graph data transmission for constructing mini-batches of training samples. Existing solutions often advocate data-computation colocation, and do not work well with limited resources and heterogeneous training devices in heterogeneous clusters. The potentials of strategical task placement and optimal scheduling of data transmission and task execution have not been well explored. This paper designs an efficient algorithm framework for task placement and execution scheduling of distributed GNN training in heterogeneous systems, to better resource utilization, improve execution pipelining, and expedite training completion. Our framework consists of two modules: (i) an online scheduling algorithm that schedules the execution of training tasks, and the data transmission plan; and (ii) an exploratory task placement scheme that decides the placement of each training task. We conduct thorough theoretical analysis, testbed experiments and simulation studies, and observe up to 48% training speed-up with our algorithm as compared to representative baselines in our testbed settings.
Persistent Identifierhttp://hdl.handle.net/10722/345696
ISSN
2023 Impact Factor: 3.0
2023 SCImago Journal Rankings: 2.034

 

DC FieldValueLanguage
dc.contributor.authorLuo, Ziyue-
dc.contributor.authorBao, Yixin-
dc.contributor.authorWu, Chuan-
dc.date.accessioned2024-08-27T09:10:34Z-
dc.date.available2024-08-27T09:10:34Z-
dc.date.issued2024-06-19-
dc.identifier.citationIEEE/ACM Transactions on Networking, 2024-
dc.identifier.issn1063-6692-
dc.identifier.urihttp://hdl.handle.net/10722/345696-
dc.description.abstractTraining Graph Neural Networks (GNNs) on large graphs is resource-intensive and time-consuming, mainly due to the large graph data that cannot be fit into the memory of a single machine, but have to be fetched from distributed graph storage and processed on the go. Unlike distributed deep neural network (DNN) training, the bottleneck in distributed GNN training lies largely in large graph data transmission for constructing mini-batches of training samples. Existing solutions often advocate data-computation colocation, and do not work well with limited resources and heterogeneous training devices in heterogeneous clusters. The potentials of strategical task placement and optimal scheduling of data transmission and task execution have not been well explored. This paper designs an efficient algorithm framework for task placement and execution scheduling of distributed GNN training in heterogeneous systems, to better resource utilization, improve execution pipelining, and expedite training completion. Our framework consists of two modules: (i) an online scheduling algorithm that schedules the execution of training tasks, and the data transmission plan; and (ii) an exploratory task placement scheme that decides the placement of each training task. We conduct thorough theoretical analysis, testbed experiments and simulation studies, and observe up to 48% training speed-up with our algorithm as compared to representative baselines in our testbed settings.-
dc.languageeng-
dc.publisherInstitute of Electrical and Electronics Engineers-
dc.relation.ispartofIEEE/ACM Transactions on Networking-
dc.subjectComputational modeling-
dc.subjectData transfer-
dc.subjectDistributed databases-
dc.subjectDistributed machine learning systems-
dc.subjectgraph neural network-
dc.subjectGraph neural networks-
dc.subjectonline scheduling-
dc.subjectOptimal scheduling-
dc.subjectTask analysis-
dc.subjectTraining-
dc.titleOptimizing Task Placement and Online Scheduling for Distributed GNN Training Acceleration in Heterogeneous Systems-
dc.typeArticle-
dc.identifier.doi10.1109/TNET.2024.3415089-
dc.identifier.scopuseid_2-s2.0-85196733205-
dc.identifier.eissn1558-2566-
dc.identifier.issnl1063-6692-

Export via OAI-PMH Interface in XML Formats


OR


Export to Other Non-XML Formats