Accelerating distributed DNN training in AI clouds

Peng, Yanghua; 彭杨华

File Download

FullText.pdf

Supplementary

Citations:
Appears in Collections:
- HKU Theses Online
- Computer Science: Theses

postgraduate thesis: Accelerating distributed DNN training in AI clouds

Title	Accelerating distributed DNN training in AI clouds
Authors	Peng, Yanghua 彭杨华
Advisors	Advisor(s):Wu, C
Issue Date	2020
Publisher	The University of Hong Kong (Pokfulam, Hong Kong)
Citation	Peng, Y. [彭杨华]. (2020). Accelerating distributed DNN training in AI clouds. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
Abstract	The increasing scale of the amount of data, model complexity, and computation infrastructure has driven the development of deep learning (DL) significantly in recent years. DL has been extensively used for a wide range of applications, such as computer vision, speech recognition, machine translation and online recommendation, etc. Training deep neural networks (DNNs), however, are computation-hungry and time-consuming. Distributed training scales out and accelerates DNN training using multiple devices. Unfortunately, its performance is often poor, due to various reasons, e.g., communication overhead, unreasonable resource configuration and runtime interference. This thesis addresses the challenges that arise when scaling distributed DNN training in AI clouds, including training parallelization, resource scheduling and elasticity. Our approach is to exploit the characteristics of DL training to co-optimize DL training frameworks and cluster schedulers. Specifically, we have designed and implemented three systems for scalable and efficient distributed DNN training, i.e., ByteScheduler, Optimus and DL2. We first address the scalability issue of training DNNs on multiple devices in a data-parallel way. ByteScheduler is a generic communication scheduler for distributed DNN training acceleration. We design a unified scheduler framework that abstracts communication scheduling from different DL training frameworks, communication architectures and network protocols, without modifying their implementations. Besides, we propose a principled scheduling algorithm that is theoretically-optimal when there is no system overhead. We identify key system parameters, i.e., partition size and credit size, and design Bayesian Optimization-based auto-tuning to make the algorithm work well in different runtime environments. Our implementation supports MXNet, PyTorch and TensorFlow, with both PS and all-reduce for gradient synchronization, and using either TCP or RDMA. ByteScheduler accelerates training with all experimented system configurations and DNN models, by up to 196%. Then we study the resource scheduling problem where multiple DL training jobs share resources in AI clouds. Optimus is a customized GPU cluster scheduler, which minimizes job training time based on online resource-performance models. Optimus uses online fitting to predict model convergence during training, and sets up performance models to accurately estimate training speed as a function of allocated resources in each job. Based on the models, a simple yet effective method is designed and used for dynamically allocating resources and placing parameter server tasks and worker tasks to minimize average job completion time. Our experiments on a Kubernetes cluster show that Optimus achieves high job performance and resource efficiency, and outperforms DRF scheduler by 139%. Finally, we investigate the use of deep reinforcement learning in elastic and dynamic resource allocation. DL2 is a deep learning-driven scheduler for GPU clusters, targeting global training job expedition via maximal cluster resource utilization. It advocates a joint supervised learning and reinforcement learning approach. The policy neural network is first warmed up via offline supervised learning based on job traces produced by the existing cluster scheduler; then it is fine-tuned by reinforcement learning carried out online. We implement DL2 on Kubernetes and enable dynamic resource scaling in DL jobs on MXNet. DL2 outperforms expert-scheduler by 17.5% in terms of average job completion time.
Degree	Doctor of Philosophy
Subject	Neural networks (Computer science) Machine learning
Dept/Program	Computer Science
Persistent Identifier	http://hdl.handle.net/10722/282303

DC Field	Value	Language
dc.contributor.advisor	Wu, C	-
dc.contributor.author	Peng, Yanghua	-
dc.contributor.author	彭杨华	-
dc.date.accessioned	2020-05-07T07:17:18Z	-
dc.date.available	2020-05-07T07:17:18Z	-
dc.date.issued	2020	-
dc.identifier.citation	Peng, Y. [彭杨华]. (2020). Accelerating distributed DNN training in AI clouds. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.	-
dc.identifier.uri	http://hdl.handle.net/10722/282303	-
dc.description.abstract	The increasing scale of the amount of data, model complexity, and computation infrastructure has driven the development of deep learning (DL) significantly in recent years. DL has been extensively used for a wide range of applications, such as computer vision, speech recognition, machine translation and online recommendation, etc. Training deep neural networks (DNNs), however, are computation-hungry and time-consuming. Distributed training scales out and accelerates DNN training using multiple devices. Unfortunately, its performance is often poor, due to various reasons, e.g., communication overhead, unreasonable resource configuration and runtime interference. This thesis addresses the challenges that arise when scaling distributed DNN training in AI clouds, including training parallelization, resource scheduling and elasticity. Our approach is to exploit the characteristics of DL training to co-optimize DL training frameworks and cluster schedulers. Specifically, we have designed and implemented three systems for scalable and efficient distributed DNN training, i.e., ByteScheduler, Optimus and DL2. We first address the scalability issue of training DNNs on multiple devices in a data-parallel way. ByteScheduler is a generic communication scheduler for distributed DNN training acceleration. We design a unified scheduler framework that abstracts communication scheduling from different DL training frameworks, communication architectures and network protocols, without modifying their implementations. Besides, we propose a principled scheduling algorithm that is theoretically-optimal when there is no system overhead. We identify key system parameters, i.e., partition size and credit size, and design Bayesian Optimization-based auto-tuning to make the algorithm work well in different runtime environments. Our implementation supports MXNet, PyTorch and TensorFlow, with both PS and all-reduce for gradient synchronization, and using either TCP or RDMA. ByteScheduler accelerates training with all experimented system configurations and DNN models, by up to 196%. Then we study the resource scheduling problem where multiple DL training jobs share resources in AI clouds. Optimus is a customized GPU cluster scheduler, which minimizes job training time based on online resource-performance models. Optimus uses online fitting to predict model convergence during training, and sets up performance models to accurately estimate training speed as a function of allocated resources in each job. Based on the models, a simple yet effective method is designed and used for dynamically allocating resources and placing parameter server tasks and worker tasks to minimize average job completion time. Our experiments on a Kubernetes cluster show that Optimus achieves high job performance and resource efficiency, and outperforms DRF scheduler by 139%. Finally, we investigate the use of deep reinforcement learning in elastic and dynamic resource allocation. DL2 is a deep learning-driven scheduler for GPU clusters, targeting global training job expedition via maximal cluster resource utilization. It advocates a joint supervised learning and reinforcement learning approach. The policy neural network is first warmed up via offline supervised learning based on job traces produced by the existing cluster scheduler; then it is fine-tuned by reinforcement learning carried out online. We implement DL2 on Kubernetes and enable dynamic resource scaling in DL jobs on MXNet. DL2 outperforms expert-scheduler by 17.5% in terms of average job completion time.	-
dc.language	eng	-
dc.publisher	The University of Hong Kong (Pokfulam, Hong Kong)	-
dc.relation.ispartof	HKU Theses Online (HKUTO)	-
dc.rights	The author retains all proprietary rights, (such as patent rights) and the right to use in future works.	-
dc.rights	This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.	-
dc.subject.lcsh	Neural networks (Computer science)	-
dc.subject.lcsh	Machine learning	-
dc.title	Accelerating distributed DNN training in AI clouds	-
dc.type	PG_Thesis	-
dc.description.thesisname	Doctor of Philosophy	-
dc.description.thesislevel	Doctoral	-
dc.description.thesisdiscipline	Computer Science	-
dc.description.nature	published_or_final_version	-
dc.date.hkucongregation	2020	-
dc.identifier.mmsid	991044229569003414	-

File Download

Supplementary

postgraduate thesis: Accelerating distributed DNN training in AI clouds

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats