File Download
Supplementary
-
Citations:
- Appears in Collections:
postgraduate thesis: Scheduling and communication optimization for distributed machine learning in data centers
Title | Scheduling and communication optimization for distributed machine learning in data centers |
---|---|
Authors | |
Issue Date | 2024 |
Publisher | The University of Hong Kong (Pokfulam, Hong Kong) |
Citation | Zhao, X. [赵晓扬]. (2024). Scheduling and communication optimization for distributed machine learning in data centers. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. |
Abstract | Nowadays deep learning (DL) models continue to grow in size to serve various machine learning (ML) applications. There is a pressing need for distributed model learning using a large number of devices (e.g., GPUs) in a data center, with parallelism technologies like data parallel, model parallel, tensor parallel, etc. However, in a production data center scenario where devices are shared by multi-tenants, the task of guaranteeing the performance of training jobs becomes a complicated challenge. Concurrent running jobs could compete for the network bandwidth and computation resources, leading to significant network congestion, performance interference, and thus low cluster throughput. Especially for distributed DL, whose communication phase among (for gradient synchronization, intermediate data exchange, etc.) devices/servers renders major bottlenecks and convergence efficiency varies along with the scale of allocated resources. Therefore, we introduce four system designs, AdapCC, MLCC, FaPES and a Multi-agent Reinforcement Learning (MARL) based scheduling framework to accelerate communications and schedule resources for distributed DL in data centers.
AdapCC is an adaptive communication library that dynamically adapts to resource heterogeneity and network variability for optimized collective communication and training performance in a data center environment. AdapCC generates communication strategies based on run-time profiling, mitigates resource waste in waiting for computation stragglers, and executes efficient data transfers among DL workers. Experimental results under various settings demonstrate 2x communication speed-up and 31% training throughput improvement with AdapCC, as compared to NCCL and other representative communication backends.
To address the network congestion issue for maximized network utilization and job training speed, in MLCC, we propose a flow scheduler per host that dynamically tunes the sending rates of outgoing tensor flows from each server. MLCC comprises two components: a monitoring module that interacts with communication libraries (e.g. NCCL, BytePS) to track the training progress of DL jobs, and a congestion control protocol that receives in-network feedback from switches and computes flow sending rates. Experiments on a 48-GPU testbed demonstrate that MLCC brings at least 15% training speedup, as compared to existing congestion protocols.
FaPES is a FaaS-oriented Performance-aware Elastic Scaling system, which enables efficient resource allocation in serverless platforms for ML jobs. FaPES enables flexible resource loaning between virtual clusters for running training and inference jobs. It employs a comprehensive auto-scaling mechanism to guarantee the service level objective (SLO) for inference jobs and improve the training efficiency for training jobs. Evaluation on a 128-GPU testbed demonstrates up to 24.8% job completion time reduction and x1.8 Goodput improvement, as compared to representative elastic scaling schemes.
Additionally, we adopt multiple schedulers in a large-scale data center, and propose a MARL scheduling framework to cooperatively learn fine-grained job placement policies, towards the objective of minimizing JCT. Proposed framework uses hierarchical graph neural networks (GNNs) to encode the data center topology and server architecture. A job interference model is further devised to predict interference levels in face of various co-locations. Testbed evaluations verify more than 20% reduction of job completion time (JCT). |
Degree | Doctor of Philosophy |
Subject | Machine learning Distributed artificial intelligence Data centers |
Dept/Program | Computer Science |
Persistent Identifier | http://hdl.handle.net/10722/352660 |
DC Field | Value | Language |
---|---|---|
dc.contributor.author | Zhao, Xiaoyang | - |
dc.contributor.author | 赵晓扬 | - |
dc.date.accessioned | 2024-12-19T09:27:03Z | - |
dc.date.available | 2024-12-19T09:27:03Z | - |
dc.date.issued | 2024 | - |
dc.identifier.citation | Zhao, X. [赵晓扬]. (2024). Scheduling and communication optimization for distributed machine learning in data centers. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. | - |
dc.identifier.uri | http://hdl.handle.net/10722/352660 | - |
dc.description.abstract | Nowadays deep learning (DL) models continue to grow in size to serve various machine learning (ML) applications. There is a pressing need for distributed model learning using a large number of devices (e.g., GPUs) in a data center, with parallelism technologies like data parallel, model parallel, tensor parallel, etc. However, in a production data center scenario where devices are shared by multi-tenants, the task of guaranteeing the performance of training jobs becomes a complicated challenge. Concurrent running jobs could compete for the network bandwidth and computation resources, leading to significant network congestion, performance interference, and thus low cluster throughput. Especially for distributed DL, whose communication phase among (for gradient synchronization, intermediate data exchange, etc.) devices/servers renders major bottlenecks and convergence efficiency varies along with the scale of allocated resources. Therefore, we introduce four system designs, AdapCC, MLCC, FaPES and a Multi-agent Reinforcement Learning (MARL) based scheduling framework to accelerate communications and schedule resources for distributed DL in data centers. AdapCC is an adaptive communication library that dynamically adapts to resource heterogeneity and network variability for optimized collective communication and training performance in a data center environment. AdapCC generates communication strategies based on run-time profiling, mitigates resource waste in waiting for computation stragglers, and executes efficient data transfers among DL workers. Experimental results under various settings demonstrate 2x communication speed-up and 31% training throughput improvement with AdapCC, as compared to NCCL and other representative communication backends. To address the network congestion issue for maximized network utilization and job training speed, in MLCC, we propose a flow scheduler per host that dynamically tunes the sending rates of outgoing tensor flows from each server. MLCC comprises two components: a monitoring module that interacts with communication libraries (e.g. NCCL, BytePS) to track the training progress of DL jobs, and a congestion control protocol that receives in-network feedback from switches and computes flow sending rates. Experiments on a 48-GPU testbed demonstrate that MLCC brings at least 15% training speedup, as compared to existing congestion protocols. FaPES is a FaaS-oriented Performance-aware Elastic Scaling system, which enables efficient resource allocation in serverless platforms for ML jobs. FaPES enables flexible resource loaning between virtual clusters for running training and inference jobs. It employs a comprehensive auto-scaling mechanism to guarantee the service level objective (SLO) for inference jobs and improve the training efficiency for training jobs. Evaluation on a 128-GPU testbed demonstrates up to 24.8% job completion time reduction and x1.8 Goodput improvement, as compared to representative elastic scaling schemes. Additionally, we adopt multiple schedulers in a large-scale data center, and propose a MARL scheduling framework to cooperatively learn fine-grained job placement policies, towards the objective of minimizing JCT. Proposed framework uses hierarchical graph neural networks (GNNs) to encode the data center topology and server architecture. A job interference model is further devised to predict interference levels in face of various co-locations. Testbed evaluations verify more than 20% reduction of job completion time (JCT). | - |
dc.language | eng | - |
dc.publisher | The University of Hong Kong (Pokfulam, Hong Kong) | - |
dc.relation.ispartof | HKU Theses Online (HKUTO) | - |
dc.rights | The author retains all proprietary rights, (such as patent rights) and the right to use in future works. | - |
dc.rights | This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. | - |
dc.subject.lcsh | Machine learning | - |
dc.subject.lcsh | Distributed artificial intelligence | - |
dc.subject.lcsh | Data centers | - |
dc.title | Scheduling and communication optimization for distributed machine learning in data centers | - |
dc.type | PG_Thesis | - |
dc.description.thesisname | Doctor of Philosophy | - |
dc.description.thesislevel | Doctoral | - |
dc.description.thesisdiscipline | Computer Science | - |
dc.description.nature | published_or_final_version | - |
dc.date.hkucongregation | 2024 | - |
dc.identifier.mmsid | 991044891406303414 | - |