File Download
Supplementary

postgraduate thesis: Automated optimization of distributed tensor programs for large DNN training and inference

TitleAutomated optimization of distributed tensor programs for large DNN training and inference
Authors
Issue Date2025
PublisherThe University of Hong Kong (Pokfulam, Hong Kong)
Citation
Zhang, S. [張實唯]. (2025). Automated optimization of distributed tensor programs for large DNN training and inference. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
AbstractModern large-scale machine learning relies heavily on complex computations expressed as tensor programs, typically distributed across diverse hardware clusters. Optimizing these programs for performance presents significant challenges due to model scale, hardware heterogeneity, intricate parallelism strategies, and dynamic execution patterns. This thesis focuses on automated optimization of distributed tensor programs, developing systems and methodologies to automatically enhance the performance of both training and inference for large models. First, we introduce TAG for optimizing DNN training, viewing the process as deploying a tensor program onto heterogeneous devices. TAG employs a graph neural network (GNN) processing both the computation graph (tensor operations) and device topology, coupled with Monte-Carlo tree search, to automatically derive optimized distributed execution plans. It further automates communication optimization via sufficient factor broadcasting. TAG demonstrates up to 4.56x speed-up, showcasing effective automated optimization for heterogeneous training programs. Second, addressing performance bottlenecks in SPMD parallelism for trillion-parameter models, we present HiDup, that optimizes distributed tensor programs with computation-communication overlapping. We propose interleaving the execution of two microbatches, where computation of a microbatch overlaps communication of the other. A dynamic programming algorithm automates the search for optimal tensor sharding strategies within this restructured program, maximizing overlap and achieving up to 61% training speed-up. Third, we develop HAP to further accelerate SPMD tensor program on heterogeneous clusters. We formulate the partitioning of the tensor program as a program synthesis problem, automatically generating an optimized distributed program from a single-device program using A*-based search. HAP automatically co-optimizes tensor sharding strategy, optimal device-specific sharding ratios (via linear programming), and efficient tensor communication methods. HAP achieves up to 2.41x speed-up, demonstrating effective automated optimization for SPMD programs on heterogeneous hardware. Finally, we tackle the automated optimization of dynamic workflows composed of multiple tensor programs. Conditional execution in these workflows hinders static optimization. We introduce DyOrc, an workflow serving system deploying components as independently scalable services. DyOrc features: (i) speculative scheduling for effective batching of tensor computations; (ii) multi-tier message passing for efficient inter-program tensor communication; and (iii) proactive loading to hide startup latency. DyOrc improves serving latency by 4-198% on diverse dynamic workloads.
DegreeDoctor of Philosophy
SubjectMachine learning
Deep learning (Machine learning)
Parallel processing (Electronic computers)
Program transformation (Computer programming)
Dept/ProgramComputer Science
Persistent Identifierhttp://hdl.handle.net/10722/363981

 

DC FieldValueLanguage
dc.contributor.authorZhang, Shiwei-
dc.contributor.author張實唯-
dc.date.accessioned2025-10-20T02:56:18Z-
dc.date.available2025-10-20T02:56:18Z-
dc.date.issued2025-
dc.identifier.citationZhang, S. [張實唯]. (2025). Automated optimization of distributed tensor programs for large DNN training and inference. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.-
dc.identifier.urihttp://hdl.handle.net/10722/363981-
dc.description.abstractModern large-scale machine learning relies heavily on complex computations expressed as tensor programs, typically distributed across diverse hardware clusters. Optimizing these programs for performance presents significant challenges due to model scale, hardware heterogeneity, intricate parallelism strategies, and dynamic execution patterns. This thesis focuses on automated optimization of distributed tensor programs, developing systems and methodologies to automatically enhance the performance of both training and inference for large models. First, we introduce TAG for optimizing DNN training, viewing the process as deploying a tensor program onto heterogeneous devices. TAG employs a graph neural network (GNN) processing both the computation graph (tensor operations) and device topology, coupled with Monte-Carlo tree search, to automatically derive optimized distributed execution plans. It further automates communication optimization via sufficient factor broadcasting. TAG demonstrates up to 4.56x speed-up, showcasing effective automated optimization for heterogeneous training programs. Second, addressing performance bottlenecks in SPMD parallelism for trillion-parameter models, we present HiDup, that optimizes distributed tensor programs with computation-communication overlapping. We propose interleaving the execution of two microbatches, where computation of a microbatch overlaps communication of the other. A dynamic programming algorithm automates the search for optimal tensor sharding strategies within this restructured program, maximizing overlap and achieving up to 61% training speed-up. Third, we develop HAP to further accelerate SPMD tensor program on heterogeneous clusters. We formulate the partitioning of the tensor program as a program synthesis problem, automatically generating an optimized distributed program from a single-device program using A*-based search. HAP automatically co-optimizes tensor sharding strategy, optimal device-specific sharding ratios (via linear programming), and efficient tensor communication methods. HAP achieves up to 2.41x speed-up, demonstrating effective automated optimization for SPMD programs on heterogeneous hardware. Finally, we tackle the automated optimization of dynamic workflows composed of multiple tensor programs. Conditional execution in these workflows hinders static optimization. We introduce DyOrc, an workflow serving system deploying components as independently scalable services. DyOrc features: (i) speculative scheduling for effective batching of tensor computations; (ii) multi-tier message passing for efficient inter-program tensor communication; and (iii) proactive loading to hide startup latency. DyOrc improves serving latency by 4-198% on diverse dynamic workloads.en
dc.languageeng-
dc.publisherThe University of Hong Kong (Pokfulam, Hong Kong)-
dc.relation.ispartofHKU Theses Online (HKUTO)-
dc.rightsThe author retains all proprietary rights, (such as patent rights) and the right to use in future works.-
dc.rightsThis work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.-
dc.subject.lcshMachine learning-
dc.subject.lcshDeep learning (Machine learning)-
dc.subject.lcshParallel processing (Electronic computers)-
dc.subject.lcshProgram transformation (Computer programming)-
dc.titleAutomated optimization of distributed tensor programs for large DNN training and inference-
dc.typePG_Thesis-
dc.description.thesisnameDoctor of Philosophy-
dc.description.thesislevelDoctoral-
dc.description.thesisdisciplineComputer Science-
dc.description.naturepublished_or_final_version-
dc.date.hkucongregation2025-
dc.identifier.mmsid991045117392003414-

Export via OAI-PMH Interface in XML Formats


OR


Export to Other Non-XML Formats