File Download
Supplementary
-
Citations:
- Appears in Collections:
postgraduate thesis: Exploiting characteristics of data parallelism for efficient distributed machine learning systems
Title | Exploiting characteristics of data parallelism for efficient distributed machine learning systems |
---|---|
Authors | |
Advisors | Advisor(s):Wu, C |
Issue Date | 2023 |
Publisher | The University of Hong Kong (Pokfulam, Hong Kong) |
Citation | Chen, Y. [陈扬锐]. (2023). Exploiting characteristics of data parallelism for efficient distributed machine learning systems. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. |
Abstract | Deep Neural Networks (DNNs) have achieved ground-breaking performance on a wide range of domains, such as computer vision, natural language processing, and recommendation. Meanwhile, the model sizes and data volumes have grown exponentially, making DNN training time-consuming and resource-intensive. Data parallelism, scaling DNN training across multiple machines, is widely adopted for accelerating distributed deep learning. Unfortunately, it often cannot fully utilize the computation resources due to various reasons, e.g., communication overhead, resource contention and long data preprocessing. This thesis demonstrates that there is great potential for accelerating distributed machine learning by exploiting the characteristics of DNN training. Four system designs that address challenges in building efficient and performant DNN training are introduced in this thesis, PSLD, SAPipe, BGL and SP-GNN. PSLD is a dynamic parameter server load distribution scheme, to mitigate PS straggler issues and accelerate distributed model training in the PS architecture. An exploitation-exploration method is carefully designed to scale in and out parameter servers and adjust parameter distribution among PSs on the go. We also design an elastic PS scaling module to carry out our scheme with little interruption to the training process. We implement our module on top of open-source PS architectures, including MXNet and BytePS. Testbed experiments show up to 2.86x speed-up in model training with PSLD, for different ML models under various straggler settings. SAPipe is performant and staleness-aware communication pipeline system that pushes the training speed of data parallelism to its fullest extent. By introducing partial staleness, the communication overlaps the computation with minimal staleness in SAPipe. To mitigate additional problems incurred by staleness, SAPipe adopts staleness compensation techniques including weight prediction and delay compensation with provably lower error bounds. Additionally, SAPipe presents an algorithm-system co-design with runtime optimization to minimize system overhead for the staleness training pipeline and staleness compensation. SAPipe achieves up to 157% speedups over BytePS (non-stale), and outperforms PipeSGD in accuracy by up to 13.7%. Graph neural networks (GNNs) extend the success of DNNs to non-Euclidean graph data, but existing systems are inefficient to train large graphs. BGL is a distributed GNN training system designed to address the GNN training bottlenecks with a few key ideas. First, we improve the graph partition algorithm to reduce cross-partition communication during subgraph sampling. Second, a static cache engine is used to minimize feature retrieving traffic. Finally, careful resource isolation reduces contention between different data preprocessing stages. Extensive experiments on various GNN models and large graph datasets show that BGL significantly outperforms existing GNN training systems by 1.9x on average. We also explore the expressive power of GNNs, and design SP-GNN, a new class of GNNs offering generic and expressive power of graph data. SP-GNN enhances the expressive power of GNN architectures by incorporating a near-isometric proximity-aware position encoder and a scalable structure encoder. Our experiments of SP-GNN shows significant improvement in classification over existing GNN models on various graph datasets. |
Degree | Doctor of Philosophy |
Subject | Machine learning Parallel programming (Computer science) |
Dept/Program | Computer Science |
Persistent Identifier | http://hdl.handle.net/10722/328945 |
DC Field | Value | Language |
---|---|---|
dc.contributor.advisor | Wu, C | - |
dc.contributor.author | Chen, Yangrui | - |
dc.contributor.author | 陈扬锐 | - |
dc.date.accessioned | 2023-08-01T06:48:30Z | - |
dc.date.available | 2023-08-01T06:48:30Z | - |
dc.date.issued | 2023 | - |
dc.identifier.citation | Chen, Y. [陈扬锐]. (2023). Exploiting characteristics of data parallelism for efficient distributed machine learning systems. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. | - |
dc.identifier.uri | http://hdl.handle.net/10722/328945 | - |
dc.description.abstract | Deep Neural Networks (DNNs) have achieved ground-breaking performance on a wide range of domains, such as computer vision, natural language processing, and recommendation. Meanwhile, the model sizes and data volumes have grown exponentially, making DNN training time-consuming and resource-intensive. Data parallelism, scaling DNN training across multiple machines, is widely adopted for accelerating distributed deep learning. Unfortunately, it often cannot fully utilize the computation resources due to various reasons, e.g., communication overhead, resource contention and long data preprocessing. This thesis demonstrates that there is great potential for accelerating distributed machine learning by exploiting the characteristics of DNN training. Four system designs that address challenges in building efficient and performant DNN training are introduced in this thesis, PSLD, SAPipe, BGL and SP-GNN. PSLD is a dynamic parameter server load distribution scheme, to mitigate PS straggler issues and accelerate distributed model training in the PS architecture. An exploitation-exploration method is carefully designed to scale in and out parameter servers and adjust parameter distribution among PSs on the go. We also design an elastic PS scaling module to carry out our scheme with little interruption to the training process. We implement our module on top of open-source PS architectures, including MXNet and BytePS. Testbed experiments show up to 2.86x speed-up in model training with PSLD, for different ML models under various straggler settings. SAPipe is performant and staleness-aware communication pipeline system that pushes the training speed of data parallelism to its fullest extent. By introducing partial staleness, the communication overlaps the computation with minimal staleness in SAPipe. To mitigate additional problems incurred by staleness, SAPipe adopts staleness compensation techniques including weight prediction and delay compensation with provably lower error bounds. Additionally, SAPipe presents an algorithm-system co-design with runtime optimization to minimize system overhead for the staleness training pipeline and staleness compensation. SAPipe achieves up to 157% speedups over BytePS (non-stale), and outperforms PipeSGD in accuracy by up to 13.7%. Graph neural networks (GNNs) extend the success of DNNs to non-Euclidean graph data, but existing systems are inefficient to train large graphs. BGL is a distributed GNN training system designed to address the GNN training bottlenecks with a few key ideas. First, we improve the graph partition algorithm to reduce cross-partition communication during subgraph sampling. Second, a static cache engine is used to minimize feature retrieving traffic. Finally, careful resource isolation reduces contention between different data preprocessing stages. Extensive experiments on various GNN models and large graph datasets show that BGL significantly outperforms existing GNN training systems by 1.9x on average. We also explore the expressive power of GNNs, and design SP-GNN, a new class of GNNs offering generic and expressive power of graph data. SP-GNN enhances the expressive power of GNN architectures by incorporating a near-isometric proximity-aware position encoder and a scalable structure encoder. Our experiments of SP-GNN shows significant improvement in classification over existing GNN models on various graph datasets. | - |
dc.language | eng | - |
dc.publisher | The University of Hong Kong (Pokfulam, Hong Kong) | - |
dc.relation.ispartof | HKU Theses Online (HKUTO) | - |
dc.rights | The author retains all proprietary rights, (such as patent rights) and the right to use in future works. | - |
dc.rights | This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. | - |
dc.subject.lcsh | Machine learning | - |
dc.subject.lcsh | Parallel programming (Computer science) | - |
dc.title | Exploiting characteristics of data parallelism for efficient distributed machine learning systems | - |
dc.type | PG_Thesis | - |
dc.description.thesisname | Doctor of Philosophy | - |
dc.description.thesislevel | Doctoral | - |
dc.description.thesisdiscipline | Computer Science | - |
dc.description.nature | published_or_final_version | - |
dc.date.hkucongregation | 2023 | - |
dc.identifier.mmsid | 991044705909403414 | - |