Automated optimization of distributed tensor programs for large DNN training and inference

Zhang, Shiwei; 張實唯

File Download

FullText.pdf

Supplementary

Citations:
Appears in Collections:
- HKU Theses Online
- Computer Science: Theses

postgraduate thesis: Automated optimization of distributed tensor programs for large DNN training and inference

Title	Automated optimization of distributed tensor programs for large DNN training and inference
Authors	Zhang, Shiwei 張實唯
Issue Date	2025
Publisher	The University of Hong Kong (Pokfulam, Hong Kong)
Citation	Zhang, S. [張實唯]. (2025). Automated optimization of distributed tensor programs for large DNN training and inference. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
Abstract	Modern large-scale machine learning relies heavily on complex computations expressed as tensor programs, typically distributed across diverse hardware clusters. Optimizing these programs for performance presents significant challenges due to model scale, hardware heterogeneity, intricate parallelism strategies, and dynamic execution patterns. This thesis focuses on automated optimization of distributed tensor programs, developing systems and methodologies to automatically enhance the performance of both training and inference for large models. First, we introduce TAG for optimizing DNN training, viewing the process as deploying a tensor program onto heterogeneous devices. TAG employs a graph neural network (GNN) processing both the computation graph (tensor operations) and device topology, coupled with Monte-Carlo tree search, to automatically derive optimized distributed execution plans. It further automates communication optimization via sufficient factor broadcasting. TAG demonstrates up to 4.56x speed-up, showcasing effective automated optimization for heterogeneous training programs. Second, addressing performance bottlenecks in SPMD parallelism for trillion-parameter models, we present HiDup, that optimizes distributed tensor programs with computation-communication overlapping. We propose interleaving the execution of two microbatches, where computation of a microbatch overlaps communication of the other. A dynamic programming algorithm automates the search for optimal tensor sharding strategies within this restructured program, maximizing overlap and achieving up to 61% training speed-up. Third, we develop HAP to further accelerate SPMD tensor program on heterogeneous clusters. We formulate the partitioning of the tensor program as a program synthesis problem, automatically generating an optimized distributed program from a single-device program using A*-based search. HAP automatically co-optimizes tensor sharding strategy, optimal device-specific sharding ratios (via linear programming), and efficient tensor communication methods. HAP achieves up to 2.41x speed-up, demonstrating effective automated optimization for SPMD programs on heterogeneous hardware. Finally, we tackle the automated optimization of dynamic workflows composed of multiple tensor programs. Conditional execution in these workflows hinders static optimization. We introduce DyOrc, an workflow serving system deploying components as independently scalable services. DyOrc features: (i) speculative scheduling for effective batching of tensor computations; (ii) multi-tier message passing for efficient inter-program tensor communication; and (iii) proactive loading to hide startup latency. DyOrc improves serving latency by 4-198% on diverse dynamic workloads.
Degree	Doctor of Philosophy
Subject	Machine learning Deep learning (Machine learning) Parallel processing (Electronic computers) Program transformation (Computer programming)
Dept/Program	Computer Science
Persistent Identifier	http://hdl.handle.net/10722/363981

DC Field	Value	Language
dc.contributor.author	Zhang, Shiwei	-
dc.contributor.author	張實唯	-
dc.date.accessioned	2025-10-20T02:56:18Z	-
dc.date.available	2025-10-20T02:56:18Z	-
dc.date.issued	2025	-
dc.identifier.citation	Zhang, S. [張實唯]. (2025). Automated optimization of distributed tensor programs for large DNN training and inference. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.	-
dc.identifier.uri	http://hdl.handle.net/10722/363981	-
dc.description.abstract	Modern large-scale machine learning relies heavily on complex computations expressed as tensor programs, typically distributed across diverse hardware clusters. Optimizing these programs for performance presents significant challenges due to model scale, hardware heterogeneity, intricate parallelism strategies, and dynamic execution patterns. This thesis focuses on automated optimization of distributed tensor programs, developing systems and methodologies to automatically enhance the performance of both training and inference for large models. First, we introduce TAG for optimizing DNN training, viewing the process as deploying a tensor program onto heterogeneous devices. TAG employs a graph neural network (GNN) processing both the computation graph (tensor operations) and device topology, coupled with Monte-Carlo tree search, to automatically derive optimized distributed execution plans. It further automates communication optimization via sufficient factor broadcasting. TAG demonstrates up to 4.56x speed-up, showcasing effective automated optimization for heterogeneous training programs. Second, addressing performance bottlenecks in SPMD parallelism for trillion-parameter models, we present HiDup, that optimizes distributed tensor programs with computation-communication overlapping. We propose interleaving the execution of two microbatches, where computation of a microbatch overlaps communication of the other. A dynamic programming algorithm automates the search for optimal tensor sharding strategies within this restructured program, maximizing overlap and achieving up to 61% training speed-up. Third, we develop HAP to further accelerate SPMD tensor program on heterogeneous clusters. We formulate the partitioning of the tensor program as a program synthesis problem, automatically generating an optimized distributed program from a single-device program using A*-based search. HAP automatically co-optimizes tensor sharding strategy, optimal device-specific sharding ratios (via linear programming), and efficient tensor communication methods. HAP achieves up to 2.41x speed-up, demonstrating effective automated optimization for SPMD programs on heterogeneous hardware. Finally, we tackle the automated optimization of dynamic workflows composed of multiple tensor programs. Conditional execution in these workflows hinders static optimization. We introduce DyOrc, an workflow serving system deploying components as independently scalable services. DyOrc features: (i) speculative scheduling for effective batching of tensor computations; (ii) multi-tier message passing for efficient inter-program tensor communication; and (iii) proactive loading to hide startup latency. DyOrc improves serving latency by 4-198% on diverse dynamic workloads.	en
dc.language	eng	-
dc.publisher	The University of Hong Kong (Pokfulam, Hong Kong)	-
dc.relation.ispartof	HKU Theses Online (HKUTO)	-
dc.rights	The author retains all proprietary rights, (such as patent rights) and the right to use in future works.	-
dc.rights	This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.	-
dc.subject.lcsh	Machine learning	-
dc.subject.lcsh	Deep learning (Machine learning)	-
dc.subject.lcsh	Parallel processing (Electronic computers)	-
dc.subject.lcsh	Program transformation (Computer programming)	-
dc.title	Automated optimization of distributed tensor programs for large DNN training and inference	-
dc.type	PG_Thesis	-
dc.description.thesisname	Doctor of Philosophy	-
dc.description.thesislevel	Doctoral	-
dc.description.thesisdiscipline	Computer Science	-
dc.description.nature	published_or_final_version	-
dc.date.hkucongregation	2025	-
dc.identifier.mmsid	991045117392003414	-

File Download

Supplementary

postgraduate thesis: Automated optimization of distributed tensor programs for large DNN training and inference

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats