File Download
Supplementary

postgraduate thesis: Exploiting characteristics of data parallelism for efficient distributed machine learning systems

TitleExploiting characteristics of data parallelism for efficient distributed machine learning systems
Authors
Advisors
Advisor(s):Wu, C
Issue Date2023
PublisherThe University of Hong Kong (Pokfulam, Hong Kong)
Citation
Chen, Y. [陈扬锐]. (2023). Exploiting characteristics of data parallelism for efficient distributed machine learning systems. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
AbstractDeep Neural Networks (DNNs) have achieved ground-breaking performance on a wide range of domains, such as computer vision, natural language processing, and recommendation. Meanwhile, the model sizes and data volumes have grown exponentially, making DNN training time-consuming and resource-intensive. Data parallelism, scaling DNN training across multiple machines, is widely adopted for accelerating distributed deep learning. Unfortunately, it often cannot fully utilize the computation resources due to various reasons, e.g., communication overhead, resource contention and long data preprocessing. This thesis demonstrates that there is great potential for accelerating distributed machine learning by exploiting the characteristics of DNN training. Four system designs that address challenges in building efficient and performant DNN training are introduced in this thesis, PSLD, SAPipe, BGL and SP-GNN. PSLD is a dynamic parameter server load distribution scheme, to mitigate PS straggler issues and accelerate distributed model training in the PS architecture. An exploitation-exploration method is carefully designed to scale in and out parameter servers and adjust parameter distribution among PSs on the go. We also design an elastic PS scaling module to carry out our scheme with little interruption to the training process. We implement our module on top of open-source PS architectures, including MXNet and BytePS. Testbed experiments show up to 2.86x speed-up in model training with PSLD, for different ML models under various straggler settings. SAPipe is performant and staleness-aware communication pipeline system that pushes the training speed of data parallelism to its fullest extent. By introducing partial staleness, the communication overlaps the computation with minimal staleness in SAPipe. To mitigate additional problems incurred by staleness, SAPipe adopts staleness compensation techniques including weight prediction and delay compensation with provably lower error bounds. Additionally, SAPipe presents an algorithm-system co-design with runtime optimization to minimize system overhead for the staleness training pipeline and staleness compensation. SAPipe achieves up to 157% speedups over BytePS (non-stale), and outperforms PipeSGD in accuracy by up to 13.7%. Graph neural networks (GNNs) extend the success of DNNs to non-Euclidean graph data, but existing systems are inefficient to train large graphs. BGL is a distributed GNN training system designed to address the GNN training bottlenecks with a few key ideas. First, we improve the graph partition algorithm to reduce cross-partition communication during subgraph sampling. Second, a static cache engine is used to minimize feature retrieving traffic. Finally, careful resource isolation reduces contention between different data preprocessing stages. Extensive experiments on various GNN models and large graph datasets show that BGL significantly outperforms existing GNN training systems by 1.9x on average. We also explore the expressive power of GNNs, and design SP-GNN, a new class of GNNs offering generic and expressive power of graph data. SP-GNN enhances the expressive power of GNN architectures by incorporating a near-isometric proximity-aware position encoder and a scalable structure encoder. Our experiments of SP-GNN shows significant improvement in classification over existing GNN models on various graph datasets.
DegreeDoctor of Philosophy
SubjectMachine learning
Parallel programming (Computer science)
Dept/ProgramComputer Science
Persistent Identifierhttp://hdl.handle.net/10722/328945

 

DC FieldValueLanguage
dc.contributor.advisorWu, C-
dc.contributor.authorChen, Yangrui-
dc.contributor.author陈扬锐-
dc.date.accessioned2023-08-01T06:48:30Z-
dc.date.available2023-08-01T06:48:30Z-
dc.date.issued2023-
dc.identifier.citationChen, Y. [陈扬锐]. (2023). Exploiting characteristics of data parallelism for efficient distributed machine learning systems. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.-
dc.identifier.urihttp://hdl.handle.net/10722/328945-
dc.description.abstractDeep Neural Networks (DNNs) have achieved ground-breaking performance on a wide range of domains, such as computer vision, natural language processing, and recommendation. Meanwhile, the model sizes and data volumes have grown exponentially, making DNN training time-consuming and resource-intensive. Data parallelism, scaling DNN training across multiple machines, is widely adopted for accelerating distributed deep learning. Unfortunately, it often cannot fully utilize the computation resources due to various reasons, e.g., communication overhead, resource contention and long data preprocessing. This thesis demonstrates that there is great potential for accelerating distributed machine learning by exploiting the characteristics of DNN training. Four system designs that address challenges in building efficient and performant DNN training are introduced in this thesis, PSLD, SAPipe, BGL and SP-GNN. PSLD is a dynamic parameter server load distribution scheme, to mitigate PS straggler issues and accelerate distributed model training in the PS architecture. An exploitation-exploration method is carefully designed to scale in and out parameter servers and adjust parameter distribution among PSs on the go. We also design an elastic PS scaling module to carry out our scheme with little interruption to the training process. We implement our module on top of open-source PS architectures, including MXNet and BytePS. Testbed experiments show up to 2.86x speed-up in model training with PSLD, for different ML models under various straggler settings. SAPipe is performant and staleness-aware communication pipeline system that pushes the training speed of data parallelism to its fullest extent. By introducing partial staleness, the communication overlaps the computation with minimal staleness in SAPipe. To mitigate additional problems incurred by staleness, SAPipe adopts staleness compensation techniques including weight prediction and delay compensation with provably lower error bounds. Additionally, SAPipe presents an algorithm-system co-design with runtime optimization to minimize system overhead for the staleness training pipeline and staleness compensation. SAPipe achieves up to 157% speedups over BytePS (non-stale), and outperforms PipeSGD in accuracy by up to 13.7%. Graph neural networks (GNNs) extend the success of DNNs to non-Euclidean graph data, but existing systems are inefficient to train large graphs. BGL is a distributed GNN training system designed to address the GNN training bottlenecks with a few key ideas. First, we improve the graph partition algorithm to reduce cross-partition communication during subgraph sampling. Second, a static cache engine is used to minimize feature retrieving traffic. Finally, careful resource isolation reduces contention between different data preprocessing stages. Extensive experiments on various GNN models and large graph datasets show that BGL significantly outperforms existing GNN training systems by 1.9x on average. We also explore the expressive power of GNNs, and design SP-GNN, a new class of GNNs offering generic and expressive power of graph data. SP-GNN enhances the expressive power of GNN architectures by incorporating a near-isometric proximity-aware position encoder and a scalable structure encoder. Our experiments of SP-GNN shows significant improvement in classification over existing GNN models on various graph datasets.-
dc.languageeng-
dc.publisherThe University of Hong Kong (Pokfulam, Hong Kong)-
dc.relation.ispartofHKU Theses Online (HKUTO)-
dc.rightsThe author retains all proprietary rights, (such as patent rights) and the right to use in future works.-
dc.rightsThis work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.-
dc.subject.lcshMachine learning-
dc.subject.lcshParallel programming (Computer science)-
dc.titleExploiting characteristics of data parallelism for efficient distributed machine learning systems-
dc.typePG_Thesis-
dc.description.thesisnameDoctor of Philosophy-
dc.description.thesislevelDoctoral-
dc.description.thesisdisciplineComputer Science-
dc.description.naturepublished_or_final_version-
dc.date.hkucongregation2023-
dc.identifier.mmsid991044705909403414-

Export via OAI-PMH Interface in XML Formats


OR


Export to Other Non-XML Formats