Exploiting characteristics of data parallelism for efficient distributed machine learning systems

Chen, Yangrui; 陈扬锐

File Download

FullText.pdf

Supplementary

Citations:
Appears in Collections:
- HKU Theses Online
- Computer Science: Theses

postgraduate thesis: Exploiting characteristics of data parallelism for efficient distributed machine learning systems

Title	Exploiting characteristics of data parallelism for efficient distributed machine learning systems
Authors	Chen, Yangrui 陈扬锐
Advisors	Advisor(s):Wu, C
Issue Date	2023
Publisher	The University of Hong Kong (Pokfulam, Hong Kong)
Citation	Chen, Y. [陈扬锐]. (2023). Exploiting characteristics of data parallelism for efficient distributed machine learning systems. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
Abstract	Deep Neural Networks (DNNs) have achieved ground-breaking performance on a wide range of domains, such as computer vision, natural language processing, and recommendation. Meanwhile, the model sizes and data volumes have grown exponentially, making DNN training time-consuming and resource-intensive. Data parallelism, scaling DNN training across multiple machines, is widely adopted for accelerating distributed deep learning. Unfortunately, it often cannot fully utilize the computation resources due to various reasons, e.g., communication overhead, resource contention and long data preprocessing. This thesis demonstrates that there is great potential for accelerating distributed machine learning by exploiting the characteristics of DNN training. Four system designs that address challenges in building efficient and performant DNN training are introduced in this thesis, PSLD, SAPipe, BGL and SP-GNN. PSLD is a dynamic parameter server load distribution scheme, to mitigate PS straggler issues and accelerate distributed model training in the PS architecture. An exploitation-exploration method is carefully designed to scale in and out parameter servers and adjust parameter distribution among PSs on the go. We also design an elastic PS scaling module to carry out our scheme with little interruption to the training process. We implement our module on top of open-source PS architectures, including MXNet and BytePS. Testbed experiments show up to 2.86x speed-up in model training with PSLD, for different ML models under various straggler settings. SAPipe is performant and staleness-aware communication pipeline system that pushes the training speed of data parallelism to its fullest extent. By introducing partial staleness, the communication overlaps the computation with minimal staleness in SAPipe. To mitigate additional problems incurred by staleness, SAPipe adopts staleness compensation techniques including weight prediction and delay compensation with provably lower error bounds. Additionally, SAPipe presents an algorithm-system co-design with runtime optimization to minimize system overhead for the staleness training pipeline and staleness compensation. SAPipe achieves up to 157% speedups over BytePS (non-stale), and outperforms PipeSGD in accuracy by up to 13.7%. Graph neural networks (GNNs) extend the success of DNNs to non-Euclidean graph data, but existing systems are inefficient to train large graphs. BGL is a distributed GNN training system designed to address the GNN training bottlenecks with a few key ideas. First, we improve the graph partition algorithm to reduce cross-partition communication during subgraph sampling. Second, a static cache engine is used to minimize feature retrieving traffic. Finally, careful resource isolation reduces contention between different data preprocessing stages. Extensive experiments on various GNN models and large graph datasets show that BGL significantly outperforms existing GNN training systems by 1.9x on average. We also explore the expressive power of GNNs, and design SP-GNN, a new class of GNNs offering generic and expressive power of graph data. SP-GNN enhances the expressive power of GNN architectures by incorporating a near-isometric proximity-aware position encoder and a scalable structure encoder. Our experiments of SP-GNN shows significant improvement in classification over existing GNN models on various graph datasets.
Degree	Doctor of Philosophy
Subject	Machine learning Parallel programming (Computer science)
Dept/Program	Computer Science
Persistent Identifier	http://hdl.handle.net/10722/328945

DC Field	Value	Language
dc.contributor.advisor	Wu, C	-
dc.contributor.author	Chen, Yangrui	-
dc.contributor.author	陈扬锐	-
dc.date.accessioned	2023-08-01T06:48:30Z	-
dc.date.available	2023-08-01T06:48:30Z	-
dc.date.issued	2023	-
dc.identifier.citation	Chen, Y. [陈扬锐]. (2023). Exploiting characteristics of data parallelism for efficient distributed machine learning systems. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.	-
dc.identifier.uri	http://hdl.handle.net/10722/328945	-
dc.description.abstract	Deep Neural Networks (DNNs) have achieved ground-breaking performance on a wide range of domains, such as computer vision, natural language processing, and recommendation. Meanwhile, the model sizes and data volumes have grown exponentially, making DNN training time-consuming and resource-intensive. Data parallelism, scaling DNN training across multiple machines, is widely adopted for accelerating distributed deep learning. Unfortunately, it often cannot fully utilize the computation resources due to various reasons, e.g., communication overhead, resource contention and long data preprocessing. This thesis demonstrates that there is great potential for accelerating distributed machine learning by exploiting the characteristics of DNN training. Four system designs that address challenges in building efficient and performant DNN training are introduced in this thesis, PSLD, SAPipe, BGL and SP-GNN. PSLD is a dynamic parameter server load distribution scheme, to mitigate PS straggler issues and accelerate distributed model training in the PS architecture. An exploitation-exploration method is carefully designed to scale in and out parameter servers and adjust parameter distribution among PSs on the go. We also design an elastic PS scaling module to carry out our scheme with little interruption to the training process. We implement our module on top of open-source PS architectures, including MXNet and BytePS. Testbed experiments show up to 2.86x speed-up in model training with PSLD, for different ML models under various straggler settings. SAPipe is performant and staleness-aware communication pipeline system that pushes the training speed of data parallelism to its fullest extent. By introducing partial staleness, the communication overlaps the computation with minimal staleness in SAPipe. To mitigate additional problems incurred by staleness, SAPipe adopts staleness compensation techniques including weight prediction and delay compensation with provably lower error bounds. Additionally, SAPipe presents an algorithm-system co-design with runtime optimization to minimize system overhead for the staleness training pipeline and staleness compensation. SAPipe achieves up to 157% speedups over BytePS (non-stale), and outperforms PipeSGD in accuracy by up to 13.7%. Graph neural networks (GNNs) extend the success of DNNs to non-Euclidean graph data, but existing systems are inefficient to train large graphs. BGL is a distributed GNN training system designed to address the GNN training bottlenecks with a few key ideas. First, we improve the graph partition algorithm to reduce cross-partition communication during subgraph sampling. Second, a static cache engine is used to minimize feature retrieving traffic. Finally, careful resource isolation reduces contention between different data preprocessing stages. Extensive experiments on various GNN models and large graph datasets show that BGL significantly outperforms existing GNN training systems by 1.9x on average. We also explore the expressive power of GNNs, and design SP-GNN, a new class of GNNs offering generic and expressive power of graph data. SP-GNN enhances the expressive power of GNN architectures by incorporating a near-isometric proximity-aware position encoder and a scalable structure encoder. Our experiments of SP-GNN shows significant improvement in classification over existing GNN models on various graph datasets.	-
dc.language	eng	-
dc.publisher	The University of Hong Kong (Pokfulam, Hong Kong)	-
dc.relation.ispartof	HKU Theses Online (HKUTO)	-
dc.rights	The author retains all proprietary rights, (such as patent rights) and the right to use in future works.	-
dc.rights	This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.	-
dc.subject.lcsh	Machine learning	-
dc.subject.lcsh	Parallel programming (Computer science)	-
dc.title	Exploiting characteristics of data parallelism for efficient distributed machine learning systems	-
dc.type	PG_Thesis	-
dc.description.thesisname	Doctor of Philosophy	-
dc.description.thesislevel	Doctoral	-
dc.description.thesisdiscipline	Computer Science	-
dc.description.nature	published_or_final_version	-
dc.date.hkucongregation	2023	-
dc.identifier.mmsid	991044705909403414	-

File Download

Supplementary

postgraduate thesis: Exploiting characteristics of data parallelism for efficient distributed machine learning systems

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats