Efficient and flexible parameter server for distributed deep learning

Yao, Xin; 姚信

File Download

FullText.pdf

Supplementary

Citations:
Appears in Collections:
- HKU Theses Online
- Computer Science: Theses

postgraduate thesis: Efficient and flexible parameter server for distributed deep learning

Title	Efficient and flexible parameter server for distributed deep learning
Authors	Yao, Xin 姚信
Advisors	Advisor(s):Wang, CL
Issue Date	2020
Publisher	The University of Hong Kong (Pokfulam, Hong Kong)
Citation	Yao, X. [姚信]. (2020). Efficient and flexible parameter server for distributed deep learning. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
Abstract	Deep learning has acquired great success in many fields, including computer vision, natural language processing, autonomous driving, speech recognition, and computer games. Due to the ever-increasing amount of data and much heavier computation workloads, training the deep neural networks (DNNs) to achieve high accuracy for some deep learning jobs usually takes several weeks. To shorten the training duration, the state-of-the-art techniques build distributed deep learning systems with traditional parameter server architectures. These systems consist of a large number of workers to accomplish the partitioned workloads in parallel and a separate set of servers to store the global model parameters without replicas. The centralized scheduler controls the synchronization strategy among workers. As the cluster scale keeps growing, a scalable synchronization controller and a replica-based parameter store are necessary to optimize the performance of distributed training. However, under current relaxed synchronization models, the straggler problem still leads to high synchronization frequency and delayed gradient propagation. The inconsistent parameter replicas that produce stale parameter reads complicate the design of staleness evaluation and replica consistency controls. Research on improving the scalability of real systems and quantitative study of distributed deep learning training is attractive and vitally important. This thesis proposes a new parameter server architecture, namely QuanPS, that uses distributed synchronization controllers to ultimately optimize the synchronization overheads and a multi-master based parameter store to provide high-throughput and reliable parameter accesses. Different from the centralized scheduler in previous parameter servers, which causes communication bottlenecks and limited flexibility, each synchronization controller on the server can independently adjust schemes for synchronizing one parameter shard among all workers. Based on these controllers, the synchronization processes on different parameter shards can be overlapped to reduce the communication time costs. The lazy pull buffer on each controller delays the execution of some pull requests, which can acquire the updated parameters to optimize the synchronization frequency. To achieve load balance at runtime, an elastic parameter slicing scheme is developed to divide the model parameters into parameter shards via configuring the mapping rules between the original keys and the new keys when accessing the parameters. Instead of the naive original data structure, we design a new data type, namely SumLattice, extended from Conflict-free Replicated Data Type for storing parameters and supporting gradients aggregation. The SumLattice replicas can concurrently handle requests, while these replicas only guarantee eventual consistency because they are synchronized in the background gossip protocol. To access parameters from replicated storage at low latency, a probabilistic consistency guarantee model quantitatively analyzes the trade-off between the parameter’s time-varying consistency status and its tolerable response latency. The ``dynamic read quorums'' approach enables fine-grained consistency control of reading parameters, which could suppress the possible harm due to stale reads and guarantee robust convergence with the consideration of server failures. Our experimental results show that the proposed framework significantly optimizes the communication overhead and improves the scalability for distributed deep learning systems, while it also guarantees robust convergence of achieving high accuracy.
Degree	Doctor of Philosophy
Subject	Machine learning
Dept/Program	Computer Science
Persistent Identifier	http://hdl.handle.net/10722/295608

DC Field	Value	Language
dc.contributor.advisor	Wang, CL	-
dc.contributor.author	Yao, Xin	-
dc.contributor.author	姚信	-
dc.date.accessioned	2021-02-02T03:05:16Z	-
dc.date.available	2021-02-02T03:05:16Z	-
dc.date.issued	2020	-
dc.identifier.citation	Yao, X. [姚信]. (2020). Efficient and flexible parameter server for distributed deep learning. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.	-
dc.identifier.uri	http://hdl.handle.net/10722/295608	-
dc.description.abstract	Deep learning has acquired great success in many fields, including computer vision, natural language processing, autonomous driving, speech recognition, and computer games. Due to the ever-increasing amount of data and much heavier computation workloads, training the deep neural networks (DNNs) to achieve high accuracy for some deep learning jobs usually takes several weeks. To shorten the training duration, the state-of-the-art techniques build distributed deep learning systems with traditional parameter server architectures. These systems consist of a large number of workers to accomplish the partitioned workloads in parallel and a separate set of servers to store the global model parameters without replicas. The centralized scheduler controls the synchronization strategy among workers. As the cluster scale keeps growing, a scalable synchronization controller and a replica-based parameter store are necessary to optimize the performance of distributed training. However, under current relaxed synchronization models, the straggler problem still leads to high synchronization frequency and delayed gradient propagation. The inconsistent parameter replicas that produce stale parameter reads complicate the design of staleness evaluation and replica consistency controls. Research on improving the scalability of real systems and quantitative study of distributed deep learning training is attractive and vitally important. This thesis proposes a new parameter server architecture, namely QuanPS, that uses distributed synchronization controllers to ultimately optimize the synchronization overheads and a multi-master based parameter store to provide high-throughput and reliable parameter accesses. Different from the centralized scheduler in previous parameter servers, which causes communication bottlenecks and limited flexibility, each synchronization controller on the server can independently adjust schemes for synchronizing one parameter shard among all workers. Based on these controllers, the synchronization processes on different parameter shards can be overlapped to reduce the communication time costs. The lazy pull buffer on each controller delays the execution of some pull requests, which can acquire the updated parameters to optimize the synchronization frequency. To achieve load balance at runtime, an elastic parameter slicing scheme is developed to divide the model parameters into parameter shards via configuring the mapping rules between the original keys and the new keys when accessing the parameters. Instead of the naive original data structure, we design a new data type, namely SumLattice, extended from Conflict-free Replicated Data Type for storing parameters and supporting gradients aggregation. The SumLattice replicas can concurrently handle requests, while these replicas only guarantee eventual consistency because they are synchronized in the background gossip protocol. To access parameters from replicated storage at low latency, a probabilistic consistency guarantee model quantitatively analyzes the trade-off between the parameter’s time-varying consistency status and its tolerable response latency. The ``dynamic read quorums'' approach enables fine-grained consistency control of reading parameters, which could suppress the possible harm due to stale reads and guarantee robust convergence with the consideration of server failures. Our experimental results show that the proposed framework significantly optimizes the communication overhead and improves the scalability for distributed deep learning systems, while it also guarantees robust convergence of achieving high accuracy.	-
dc.language	eng	-
dc.publisher	The University of Hong Kong (Pokfulam, Hong Kong)	-
dc.relation.ispartof	HKU Theses Online (HKUTO)	-
dc.rights	The author retains all proprietary rights, (such as patent rights) and the right to use in future works.	-
dc.rights	This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.	-
dc.subject.lcsh	Machine learning	-
dc.title	Efficient and flexible parameter server for distributed deep learning	-
dc.type	PG_Thesis	-
dc.description.thesisname	Doctor of Philosophy	-
dc.description.thesislevel	Doctoral	-
dc.description.thesisdiscipline	Computer Science	-
dc.description.nature	published_or_final_version	-
dc.date.hkucongregation	2021	-
dc.identifier.mmsid	991044340098503414	-

File Download

Supplementary

postgraduate thesis: Efficient and flexible parameter server for distributed deep learning

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats