File Download
Supplementary

postgraduate thesis: Efficient and flexible parameter server for distributed deep learning

TitleEfficient and flexible parameter server for distributed deep learning
Authors
Advisors
Advisor(s):Wang, CL
Issue Date2020
PublisherThe University of Hong Kong (Pokfulam, Hong Kong)
Citation
Yao, X. [姚信]. (2020). Efficient and flexible parameter server for distributed deep learning. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
AbstractDeep learning has acquired great success in many fields, including computer vision, natural language processing, autonomous driving, speech recognition, and computer games. Due to the ever-increasing amount of data and much heavier computation workloads, training the deep neural networks (DNNs) to achieve high accuracy for some deep learning jobs usually takes several weeks. To shorten the training duration, the state-of-the-art techniques build distributed deep learning systems with traditional parameter server architectures. These systems consist of a large number of workers to accomplish the partitioned workloads in parallel and a separate set of servers to store the global model parameters without replicas. The centralized scheduler controls the synchronization strategy among workers. As the cluster scale keeps growing, a scalable synchronization controller and a replica-based parameter store are necessary to optimize the performance of distributed training. However, under current relaxed synchronization models, the straggler problem still leads to high synchronization frequency and delayed gradient propagation. The inconsistent parameter replicas that produce stale parameter reads complicate the design of staleness evaluation and replica consistency controls. Research on improving the scalability of real systems and quantitative study of distributed deep learning training is attractive and vitally important. This thesis proposes a new parameter server architecture, namely QuanPS, that uses distributed synchronization controllers to ultimately optimize the synchronization overheads and a multi-master based parameter store to provide high-throughput and reliable parameter accesses. Different from the centralized scheduler in previous parameter servers, which causes communication bottlenecks and limited flexibility, each synchronization controller on the server can independently adjust schemes for synchronizing one parameter shard among all workers. Based on these controllers, the synchronization processes on different parameter shards can be overlapped to reduce the communication time costs. The lazy pull buffer on each controller delays the execution of some pull requests, which can acquire the updated parameters to optimize the synchronization frequency. To achieve load balance at runtime, an elastic parameter slicing scheme is developed to divide the model parameters into parameter shards via configuring the mapping rules between the original keys and the new keys when accessing the parameters. Instead of the naive original data structure, we design a new data type, namely SumLattice, extended from Conflict-free Replicated Data Type for storing parameters and supporting gradients aggregation. The SumLattice replicas can concurrently handle requests, while these replicas only guarantee eventual consistency because they are synchronized in the background gossip protocol. To access parameters from replicated storage at low latency, a probabilistic consistency guarantee model quantitatively analyzes the trade-off between the parameter’s time-varying consistency status and its tolerable response latency. The ``dynamic read quorums'' approach enables fine-grained consistency control of reading parameters, which could suppress the possible harm due to stale reads and guarantee robust convergence with the consideration of server failures. Our experimental results show that the proposed framework significantly optimizes the communication overhead and improves the scalability for distributed deep learning systems, while it also guarantees robust convergence of achieving high accuracy.
DegreeDoctor of Philosophy
SubjectMachine learning
Dept/ProgramComputer Science
Persistent Identifierhttp://hdl.handle.net/10722/295608

 

DC FieldValueLanguage
dc.contributor.advisorWang, CL-
dc.contributor.authorYao, Xin-
dc.contributor.author姚信-
dc.date.accessioned2021-02-02T03:05:16Z-
dc.date.available2021-02-02T03:05:16Z-
dc.date.issued2020-
dc.identifier.citationYao, X. [姚信]. (2020). Efficient and flexible parameter server for distributed deep learning. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.-
dc.identifier.urihttp://hdl.handle.net/10722/295608-
dc.description.abstractDeep learning has acquired great success in many fields, including computer vision, natural language processing, autonomous driving, speech recognition, and computer games. Due to the ever-increasing amount of data and much heavier computation workloads, training the deep neural networks (DNNs) to achieve high accuracy for some deep learning jobs usually takes several weeks. To shorten the training duration, the state-of-the-art techniques build distributed deep learning systems with traditional parameter server architectures. These systems consist of a large number of workers to accomplish the partitioned workloads in parallel and a separate set of servers to store the global model parameters without replicas. The centralized scheduler controls the synchronization strategy among workers. As the cluster scale keeps growing, a scalable synchronization controller and a replica-based parameter store are necessary to optimize the performance of distributed training. However, under current relaxed synchronization models, the straggler problem still leads to high synchronization frequency and delayed gradient propagation. The inconsistent parameter replicas that produce stale parameter reads complicate the design of staleness evaluation and replica consistency controls. Research on improving the scalability of real systems and quantitative study of distributed deep learning training is attractive and vitally important. This thesis proposes a new parameter server architecture, namely QuanPS, that uses distributed synchronization controllers to ultimately optimize the synchronization overheads and a multi-master based parameter store to provide high-throughput and reliable parameter accesses. Different from the centralized scheduler in previous parameter servers, which causes communication bottlenecks and limited flexibility, each synchronization controller on the server can independently adjust schemes for synchronizing one parameter shard among all workers. Based on these controllers, the synchronization processes on different parameter shards can be overlapped to reduce the communication time costs. The lazy pull buffer on each controller delays the execution of some pull requests, which can acquire the updated parameters to optimize the synchronization frequency. To achieve load balance at runtime, an elastic parameter slicing scheme is developed to divide the model parameters into parameter shards via configuring the mapping rules between the original keys and the new keys when accessing the parameters. Instead of the naive original data structure, we design a new data type, namely SumLattice, extended from Conflict-free Replicated Data Type for storing parameters and supporting gradients aggregation. The SumLattice replicas can concurrently handle requests, while these replicas only guarantee eventual consistency because they are synchronized in the background gossip protocol. To access parameters from replicated storage at low latency, a probabilistic consistency guarantee model quantitatively analyzes the trade-off between the parameter’s time-varying consistency status and its tolerable response latency. The ``dynamic read quorums'' approach enables fine-grained consistency control of reading parameters, which could suppress the possible harm due to stale reads and guarantee robust convergence with the consideration of server failures. Our experimental results show that the proposed framework significantly optimizes the communication overhead and improves the scalability for distributed deep learning systems, while it also guarantees robust convergence of achieving high accuracy.-
dc.languageeng-
dc.publisherThe University of Hong Kong (Pokfulam, Hong Kong)-
dc.relation.ispartofHKU Theses Online (HKUTO)-
dc.rightsThe author retains all proprietary rights, (such as patent rights) and the right to use in future works.-
dc.rightsThis work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.-
dc.subject.lcshMachine learning-
dc.titleEfficient and flexible parameter server for distributed deep learning-
dc.typePG_Thesis-
dc.description.thesisnameDoctor of Philosophy-
dc.description.thesislevelDoctoral-
dc.description.thesisdisciplineComputer Science-
dc.description.naturepublished_or_final_version-
dc.date.hkucongregation2021-
dc.identifier.mmsid991044340098503414-

Export via OAI-PMH Interface in XML Formats


OR


Export to Other Non-XML Formats