Optimizing Task Placement and Online Scheduling for Distributed GNN Training Acceleration in Heterogeneous Systems

Luo, Ziyue; Bao, Yixin; Wu, Chuan

File Download

There are no files associated with this item.

Links for fulltext

(May Require Subscription)

Publisher Website: 10.1109/TNET.2024.3415089
Scopus: eid_2-s2.0-85196733205
Find via

Supplementary

Citations:
- Scopus: 0
Appears in Collections:
- Computer Science: Journal/Magazine Articles

Article: Optimizing Task Placement and Online Scheduling for Distributed GNN Training Acceleration in Heterogeneous Systems

Title	Optimizing Task Placement and Online Scheduling for Distributed GNN Training Acceleration in Heterogeneous Systems
Authors	Luo, Ziyue Bao, Yixin Wu, Chuan
Keywords	Computational modeling Data transfer Distributed databases Distributed machine learning systems graph neural network Graph neural networks online scheduling Optimal scheduling Task analysis Training
Issue Date	19-Jun-2024
Publisher	Institute of Electrical and Electronics Engineers
Citation	IEEE/ACM Transactions on Networking, 2024 How to Cite? DOI: http://dx.doi.org/10.1109/TNET.2024.3415089
Abstract	Training Graph Neural Networks (GNNs) on large graphs is resource-intensive and time-consuming, mainly due to the large graph data that cannot be fit into the memory of a single machine, but have to be fetched from distributed graph storage and processed on the go. Unlike distributed deep neural network (DNN) training, the bottleneck in distributed GNN training lies largely in large graph data transmission for constructing mini-batches of training samples. Existing solutions often advocate data-computation colocation, and do not work well with limited resources and heterogeneous training devices in heterogeneous clusters. The potentials of strategical task placement and optimal scheduling of data transmission and task execution have not been well explored. This paper designs an efficient algorithm framework for task placement and execution scheduling of distributed GNN training in heterogeneous systems, to better resource utilization, improve execution pipelining, and expedite training completion. Our framework consists of two modules: (i) an online scheduling algorithm that schedules the execution of training tasks, and the data transmission plan; and (ii) an exploratory task placement scheme that decides the placement of each training task. We conduct thorough theoretical analysis, testbed experiments and simulation studies, and observe up to 48% training speed-up with our algorithm as compared to representative baselines in our testbed settings.
Persistent Identifier	http://hdl.handle.net/10722/345696
ISSN	1063-6692 2023 Impact Factor: 3.0 2023 SCImago Journal Rankings: 2.034

DC Field	Value	Language
dc.contributor.author	Luo, Ziyue	-
dc.contributor.author	Bao, Yixin	-
dc.contributor.author	Wu, Chuan	-
dc.date.accessioned	2024-08-27T09:10:34Z	-
dc.date.available	2024-08-27T09:10:34Z	-
dc.date.issued	2024-06-19	-
dc.identifier.citation	IEEE/ACM Transactions on Networking, 2024	-
dc.identifier.issn	1063-6692	-
dc.identifier.uri	http://hdl.handle.net/10722/345696	-
dc.description.abstract	Training Graph Neural Networks (GNNs) on large graphs is resource-intensive and time-consuming, mainly due to the large graph data that cannot be fit into the memory of a single machine, but have to be fetched from distributed graph storage and processed on the go. Unlike distributed deep neural network (DNN) training, the bottleneck in distributed GNN training lies largely in large graph data transmission for constructing mini-batches of training samples. Existing solutions often advocate data-computation colocation, and do not work well with limited resources and heterogeneous training devices in heterogeneous clusters. The potentials of strategical task placement and optimal scheduling of data transmission and task execution have not been well explored. This paper designs an efficient algorithm framework for task placement and execution scheduling of distributed GNN training in heterogeneous systems, to better resource utilization, improve execution pipelining, and expedite training completion. Our framework consists of two modules: (i) an online scheduling algorithm that schedules the execution of training tasks, and the data transmission plan; and (ii) an exploratory task placement scheme that decides the placement of each training task. We conduct thorough theoretical analysis, testbed experiments and simulation studies, and observe up to 48% training speed-up with our algorithm as compared to representative baselines in our testbed settings.	-
dc.language	eng	-
dc.publisher	Institute of Electrical and Electronics Engineers	-
dc.relation.ispartof	IEEE/ACM Transactions on Networking	-
dc.subject	Computational modeling	-
dc.subject	Data transfer	-
dc.subject	Distributed databases	-
dc.subject	Distributed machine learning systems	-
dc.subject	graph neural network	-
dc.subject	Graph neural networks	-
dc.subject	online scheduling	-
dc.subject	Optimal scheduling	-
dc.subject	Task analysis	-
dc.subject	Training	-
dc.title	Optimizing Task Placement and Online Scheduling for Distributed GNN Training Acceleration in Heterogeneous Systems	-
dc.type	Article	-
dc.identifier.doi	10.1109/TNET.2024.3415089	-
dc.identifier.scopus	eid_2-s2.0-85196733205	-
dc.identifier.eissn	1558-2566	-
dc.identifier.issnl	1063-6692	-

File Download

Links for fulltext

(May Require Subscription)

Supplementary

Article: Optimizing Task Placement and Online Scheduling for Distributed GNN Training Acceleration in Heterogeneous Systems

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats