Deep Learning-Driven Schedulers for Distributed Machine Learning Clusters


Grant Data
Project Title
Deep Learning-Driven Schedulers for Distributed Machine Learning Clusters
Principal Investigator
Professor Wu, Chuan   (Principal Investigator (PI))
Duration
36
Start Date
2020-01-01
Amount
731089
Conference Title
Deep Learning-Driven Schedulers for Distributed Machine Learning Clusters
Keywords
Cloud Computing Clusters, Distributed ML Systems, Reinforcement Learning, Resource scheduler
Discipline
SoftwareOthers - Computing Science and Information Technology
Panel
Engineering (E)
HKU Project Code
17204619
Grant Type
General Research Fund (GRF)
Funding Year
2019
Status
Completed
Objectives
1) [DL Algorithm for Resource Allocation among Distributed ML Jobs]: Design efficient DL model and algorithm for adjusting worker/PS numbers in concurrent training jobs in an ML cluster; 2) [DL Algorithm for Task Placement in the ML Cluster]: Design efficient DL model and algorithm for worker/PS placement of concurrent training jobs; 3) [Joint DL Approach for Resource Allocation and Placement in the ML Cluster]: Design hierarchical DL model and algorithm for dynamical allocation and placement of workers and PSs in training jobs; 4) [Implementation and Evaluation]: Implement an ML cluster scheduler running our algorithms, and deploy it in real-world AI clouds for evaluation and comparison.