Efficient parameter update strategy for distributed deep learning system

Zhang, Zhaorui; 張兆瑞

File Download

FullText.pdf

Supplementary

Citations:
Appears in Collections:
- HKU Theses Online
- Computer Science: Theses

postgraduate thesis: Efficient parameter update strategy for distributed deep learning system

Title	Efficient parameter update strategy for distributed deep learning system
Authors	Zhang, Zhaorui 張兆瑞
Advisors	Advisor(s):Wang, CL
Issue Date	2021
Publisher	The University of Hong Kong (Pokfulam, Hong Kong)
Citation	Zhang, Z. [張兆瑞]. (2021). Efficient parameter update strategy for distributed deep learning system. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
Abstract	Deep neural networks (DNNs) have made a great breakthrough and demonstrated remarkable success for learning tasks in multiple domains including autonomous driving, natural language processing, computer vision, and speed recognition in recent years. With the scaling up of the DNN model with respect to parameter size and rapid increase of the training dataset, DNN training becomes a time-consuming task and usually takes several days even weeks. High-performance computing clusters that consist of multiple high-density multi-GPU servers connected by the high-speed network (HDGib) promote the blooming of distributed deep learning, which has been promising to shorten the training time. Parameter server architecture with data parallelism and asynchronous parameter update strategy has been identified as an efficient approach for scaling DNNs training on HDGib clusters. However, the performance improvement gained by the high-performance computing resources is often limited and far from the theoretical value. Communication and synchronization overhead has been widely considered as the bottleneck when deploying large-scale DNN training on HDGib clusters. They often occupy a significant portion of overall training time for each worker during the training procedure, which can be as high as 90%. To improve the training performance, emerging related works are proposed to reduce the communication and synchronization overhead including the local SGD, gradient compression, and communication_computation overlapping approaches. We identify five research issues based on the previous works: the communication period determination of the asynchronous local SGD, unbounded delay problem in asynchronous local SGD, the significant gradient identification, the gradient compensation for gradient sparsification, gradient quantization based on probability distribution. We provide five solutions to address the above research issues in this thesis. First, we modeled the asynchronous local-SGD as a Global Variable Consensus problem and defined a trigger according to the trigger definition in Multi-Agent-Systems (MASs) to determine when the worker updates the parameter with the server for the asynchronous local-SGD. Second, we calculate a weight for the gradient according to the most recent global loss value to solve the unbounded delay problem. Third, we identify the significant gradient according to the significance of their corresponding parameters defined by the model interpretability to prioritize higher updates for the significant parameters. Fourth, we propose using the exponential smoothing prediction approach to compensate the gradient for reducing the gradient error produced by the gradient sparsification. Fifth, we provide a layer-wise gradient quantization algorithm based on their probability distribution to minimize the quantization error and accelerate the DNN training. Furthermore, we indicate that when deploying the distributed DNN training on the HDGib clusters, besides the communication overhead, the gradient error caused by the random sampling of the training data for distributed SGD and magnified stale gradient problem also affects the convergence of the DNN training extremely. We further explore a momentum-driven adaptive synchronization algorithm to constrain the gradient error and address such problems. Extensive evaluation results generated based on the HDGib clusters and most popular DNN model and training dataset indicate that our proposed algorithms significantly improve the communication efficiency and reduce the gradient error for the distributed DNN training on the HDGib clusters, while also guaranteeing robust convergence of achieving high accuracy for various DNN models.
Degree	Doctor of Philosophy
Subject	Machine learning
Dept/Program	Computer Science
Persistent Identifier	http://hdl.handle.net/10722/311674

DC Field	Value	Language
dc.contributor.advisor	Wang, CL	-
dc.contributor.author	Zhang, Zhaorui	-
dc.contributor.author	張兆瑞	-
dc.date.accessioned	2022-03-30T05:42:21Z	-
dc.date.available	2022-03-30T05:42:21Z	-
dc.date.issued	2021	-
dc.identifier.citation	Zhang, Z. [張兆瑞]. (2021). Efficient parameter update strategy for distributed deep learning system. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.	-
dc.identifier.uri	http://hdl.handle.net/10722/311674	-
dc.description.abstract	Deep neural networks (DNNs) have made a great breakthrough and demonstrated remarkable success for learning tasks in multiple domains including autonomous driving, natural language processing, computer vision, and speed recognition in recent years. With the scaling up of the DNN model with respect to parameter size and rapid increase of the training dataset, DNN training becomes a time-consuming task and usually takes several days even weeks. High-performance computing clusters that consist of multiple high-density multi-GPU servers connected by the high-speed network (HDGib) promote the blooming of distributed deep learning, which has been promising to shorten the training time. Parameter server architecture with data parallelism and asynchronous parameter update strategy has been identified as an efficient approach for scaling DNNs training on HDGib clusters. However, the performance improvement gained by the high-performance computing resources is often limited and far from the theoretical value. Communication and synchronization overhead has been widely considered as the bottleneck when deploying large-scale DNN training on HDGib clusters. They often occupy a significant portion of overall training time for each worker during the training procedure, which can be as high as 90%. To improve the training performance, emerging related works are proposed to reduce the communication and synchronization overhead including the local SGD, gradient compression, and communication_computation overlapping approaches. We identify five research issues based on the previous works: the communication period determination of the asynchronous local SGD, unbounded delay problem in asynchronous local SGD, the significant gradient identification, the gradient compensation for gradient sparsification, gradient quantization based on probability distribution. We provide five solutions to address the above research issues in this thesis. First, we modeled the asynchronous local-SGD as a Global Variable Consensus problem and defined a trigger according to the trigger definition in Multi-Agent-Systems (MASs) to determine when the worker updates the parameter with the server for the asynchronous local-SGD. Second, we calculate a weight for the gradient according to the most recent global loss value to solve the unbounded delay problem. Third, we identify the significant gradient according to the significance of their corresponding parameters defined by the model interpretability to prioritize higher updates for the significant parameters. Fourth, we propose using the exponential smoothing prediction approach to compensate the gradient for reducing the gradient error produced by the gradient sparsification. Fifth, we provide a layer-wise gradient quantization algorithm based on their probability distribution to minimize the quantization error and accelerate the DNN training. Furthermore, we indicate that when deploying the distributed DNN training on the HDGib clusters, besides the communication overhead, the gradient error caused by the random sampling of the training data for distributed SGD and magnified stale gradient problem also affects the convergence of the DNN training extremely. We further explore a momentum-driven adaptive synchronization algorithm to constrain the gradient error and address such problems. Extensive evaluation results generated based on the HDGib clusters and most popular DNN model and training dataset indicate that our proposed algorithms significantly improve the communication efficiency and reduce the gradient error for the distributed DNN training on the HDGib clusters, while also guaranteeing robust convergence of achieving high accuracy for various DNN models.	-
dc.language	eng	-
dc.publisher	The University of Hong Kong (Pokfulam, Hong Kong)	-
dc.relation.ispartof	HKU Theses Online (HKUTO)	-
dc.rights	The author retains all proprietary rights, (such as patent rights) and the right to use in future works.	-
dc.rights	This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.	-
dc.subject.lcsh	Machine learning	-
dc.title	Efficient parameter update strategy for distributed deep learning system	-
dc.type	PG_Thesis	-
dc.description.thesisname	Doctor of Philosophy	-
dc.description.thesislevel	Doctoral	-
dc.description.thesisdiscipline	Computer Science	-
dc.description.nature	published_or_final_version	-
dc.date.hkucongregation	2022	-
dc.identifier.mmsid	991044494006303414	-

File Download

Supplementary

postgraduate thesis: Efficient parameter update strategy for distributed deep learning system

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats