Accelerating Distributed DNN Training through Fine-grained Communication and Computation Placement and Scheduling


Grant Data
Project Title
Accelerating Distributed DNN Training through Fine-grained Communication and Computation Placement and Scheduling
Principal Investigator
Professor Wu, Chuan   (Principal Investigator (PI))
Duration
36
Start Date
2020-09-01
Completion Date
2023-08-31
Amount
845055
Conference Title
Accelerating Distributed DNN Training through Fine-grained Communication and Computation Placement and Scheduling
Keywords
Communication scheduling, Device placement, Distributed ML systems, DNN Training
Discipline
NetworkOthers - Computing Science and Information Technology
Panel
Engineering (E)
HKU Project Code
17208920
Grant Type
General Research Fund (GRF)
Funding Year
2020
Status
Completed
Objectives
1) [Algorithms for Optimal Communication Scheduling in General DNN Training DAG]: Design efficient and near-optimal communication tensor partition/merge and transmission scheduling algorithms for any given DNN training DAG. 2) [Algorithms for Computation Operator Placement and Execution Scheduling]: Design efficient and near-optimal device placement and execution order scheduling algorithms for computation operators in a DNN training DAG. 3) [Algorithms for Joint Computation and Communication Scheduling]: Design efficient strategies for joint computation and communication operator partition/merge, placement and execution order scheduling, as well as parameter synchronisation architecture selection. 4) [Implementation and Evaluation]: Implement our algorithms and strategies on ML cluster schedulers, and deploy them in real-world AI clouds for evaluation.