Scheduling and communication optimization for distributed machine learning in data centers

Zhao, Xiaoyang; 赵晓扬

File Download

FullText.pdf

Supplementary

Citations:
Appears in Collections:
- HKU Theses Online
- Computer Science: Theses

postgraduate thesis: Scheduling and communication optimization for distributed machine learning in data centers

Title	Scheduling and communication optimization for distributed machine learning in data centers
Authors	Zhao, Xiaoyang 赵晓扬
Issue Date	2024
Publisher	The University of Hong Kong (Pokfulam, Hong Kong)
Citation	Zhao, X. [赵晓扬]. (2024). Scheduling and communication optimization for distributed machine learning in data centers. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
Abstract	Nowadays deep learning (DL) models continue to grow in size to serve various machine learning (ML) applications. There is a pressing need for distributed model learning using a large number of devices (e.g., GPUs) in a data center, with parallelism technologies like data parallel, model parallel, tensor parallel, etc. However, in a production data center scenario where devices are shared by multi-tenants, the task of guaranteeing the performance of training jobs becomes a complicated challenge. Concurrent running jobs could compete for the network bandwidth and computation resources, leading to significant network congestion, performance interference, and thus low cluster throughput. Especially for distributed DL, whose communication phase among (for gradient synchronization, intermediate data exchange, etc.) devices/servers renders major bottlenecks and convergence efficiency varies along with the scale of allocated resources. Therefore, we introduce four system designs, AdapCC, MLCC, FaPES and a Multi-agent Reinforcement Learning (MARL) based scheduling framework to accelerate communications and schedule resources for distributed DL in data centers. AdapCC is an adaptive communication library that dynamically adapts to resource heterogeneity and network variability for optimized collective communication and training performance in a data center environment. AdapCC generates communication strategies based on run-time profiling, mitigates resource waste in waiting for computation stragglers, and executes efficient data transfers among DL workers. Experimental results under various settings demonstrate 2x communication speed-up and 31% training throughput improvement with AdapCC, as compared to NCCL and other representative communication backends. To address the network congestion issue for maximized network utilization and job training speed, in MLCC, we propose a flow scheduler per host that dynamically tunes the sending rates of outgoing tensor flows from each server. MLCC comprises two components: a monitoring module that interacts with communication libraries (e.g. NCCL, BytePS) to track the training progress of DL jobs, and a congestion control protocol that receives in-network feedback from switches and computes flow sending rates. Experiments on a 48-GPU testbed demonstrate that MLCC brings at least 15% training speedup, as compared to existing congestion protocols. FaPES is a FaaS-oriented Performance-aware Elastic Scaling system, which enables efficient resource allocation in serverless platforms for ML jobs. FaPES enables flexible resource loaning between virtual clusters for running training and inference jobs. It employs a comprehensive auto-scaling mechanism to guarantee the service level objective (SLO) for inference jobs and improve the training efficiency for training jobs. Evaluation on a 128-GPU testbed demonstrates up to 24.8% job completion time reduction and x1.8 Goodput improvement, as compared to representative elastic scaling schemes. Additionally, we adopt multiple schedulers in a large-scale data center, and propose a MARL scheduling framework to cooperatively learn fine-grained job placement policies, towards the objective of minimizing JCT. Proposed framework uses hierarchical graph neural networks (GNNs) to encode the data center topology and server architecture. A job interference model is further devised to predict interference levels in face of various co-locations. Testbed evaluations verify more than 20% reduction of job completion time (JCT).
Degree	Doctor of Philosophy
Subject	Machine learning Distributed artificial intelligence Data centers
Dept/Program	Computer Science
Persistent Identifier	http://hdl.handle.net/10722/352660

DC Field	Value	Language
dc.contributor.author	Zhao, Xiaoyang	-
dc.contributor.author	赵晓扬	-
dc.date.accessioned	2024-12-19T09:27:03Z	-
dc.date.available	2024-12-19T09:27:03Z	-
dc.date.issued	2024	-
dc.identifier.citation	Zhao, X. [赵晓扬]. (2024). Scheduling and communication optimization for distributed machine learning in data centers. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.	-
dc.identifier.uri	http://hdl.handle.net/10722/352660	-
dc.description.abstract	Nowadays deep learning (DL) models continue to grow in size to serve various machine learning (ML) applications. There is a pressing need for distributed model learning using a large number of devices (e.g., GPUs) in a data center, with parallelism technologies like data parallel, model parallel, tensor parallel, etc. However, in a production data center scenario where devices are shared by multi-tenants, the task of guaranteeing the performance of training jobs becomes a complicated challenge. Concurrent running jobs could compete for the network bandwidth and computation resources, leading to significant network congestion, performance interference, and thus low cluster throughput. Especially for distributed DL, whose communication phase among (for gradient synchronization, intermediate data exchange, etc.) devices/servers renders major bottlenecks and convergence efficiency varies along with the scale of allocated resources. Therefore, we introduce four system designs, AdapCC, MLCC, FaPES and a Multi-agent Reinforcement Learning (MARL) based scheduling framework to accelerate communications and schedule resources for distributed DL in data centers. AdapCC is an adaptive communication library that dynamically adapts to resource heterogeneity and network variability for optimized collective communication and training performance in a data center environment. AdapCC generates communication strategies based on run-time profiling, mitigates resource waste in waiting for computation stragglers, and executes efficient data transfers among DL workers. Experimental results under various settings demonstrate 2x communication speed-up and 31% training throughput improvement with AdapCC, as compared to NCCL and other representative communication backends. To address the network congestion issue for maximized network utilization and job training speed, in MLCC, we propose a flow scheduler per host that dynamically tunes the sending rates of outgoing tensor flows from each server. MLCC comprises two components: a monitoring module that interacts with communication libraries (e.g. NCCL, BytePS) to track the training progress of DL jobs, and a congestion control protocol that receives in-network feedback from switches and computes flow sending rates. Experiments on a 48-GPU testbed demonstrate that MLCC brings at least 15% training speedup, as compared to existing congestion protocols. FaPES is a FaaS-oriented Performance-aware Elastic Scaling system, which enables efficient resource allocation in serverless platforms for ML jobs. FaPES enables flexible resource loaning between virtual clusters for running training and inference jobs. It employs a comprehensive auto-scaling mechanism to guarantee the service level objective (SLO) for inference jobs and improve the training efficiency for training jobs. Evaluation on a 128-GPU testbed demonstrates up to 24.8% job completion time reduction and x1.8 Goodput improvement, as compared to representative elastic scaling schemes. Additionally, we adopt multiple schedulers in a large-scale data center, and propose a MARL scheduling framework to cooperatively learn fine-grained job placement policies, towards the objective of minimizing JCT. Proposed framework uses hierarchical graph neural networks (GNNs) to encode the data center topology and server architecture. A job interference model is further devised to predict interference levels in face of various co-locations. Testbed evaluations verify more than 20% reduction of job completion time (JCT).	-
dc.language	eng	-
dc.publisher	The University of Hong Kong (Pokfulam, Hong Kong)	-
dc.relation.ispartof	HKU Theses Online (HKUTO)	-
dc.rights	The author retains all proprietary rights, (such as patent rights) and the right to use in future works.	-
dc.rights	This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.	-
dc.subject.lcsh	Machine learning	-
dc.subject.lcsh	Distributed artificial intelligence	-
dc.subject.lcsh	Data centers	-
dc.title	Scheduling and communication optimization for distributed machine learning in data centers	-
dc.type	PG_Thesis	-
dc.description.thesisname	Doctor of Philosophy	-
dc.description.thesislevel	Doctoral	-
dc.description.thesisdiscipline	Computer Science	-
dc.description.nature	published_or_final_version	-
dc.date.hkucongregation	2024	-
dc.identifier.mmsid	991044891406303414	-

File Download

Supplementary

postgraduate thesis: Scheduling and communication optimization for distributed machine learning in data centers

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats