File Download
Supplementary
-
Citations:
- Appears in Collections:
postgraduate thesis: Automated optimization of distributed tensor programs for large DNN training and inference
| Title | Automated optimization of distributed tensor programs for large DNN training and inference |
|---|---|
| Authors | |
| Issue Date | 2025 |
| Publisher | The University of Hong Kong (Pokfulam, Hong Kong) |
| Citation | Zhang, S. [張實唯]. (2025). Automated optimization of distributed tensor programs for large DNN training and inference. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. |
| Abstract | Modern large-scale machine learning relies heavily on complex computations expressed as tensor programs, typically distributed across diverse hardware clusters. Optimizing these programs for performance presents significant challenges due to model scale, hardware heterogeneity, intricate parallelism strategies, and dynamic execution patterns. This thesis focuses on automated optimization of distributed tensor programs, developing systems and methodologies to automatically enhance the performance of both training and inference for large models.
First, we introduce TAG for optimizing DNN training, viewing the process as deploying a tensor program onto heterogeneous devices. TAG employs a graph neural network (GNN) processing both the computation graph (tensor operations) and device topology, coupled with Monte-Carlo tree search, to automatically derive optimized distributed execution plans. It further automates communication optimization via sufficient factor broadcasting. TAG demonstrates up to 4.56x speed-up, showcasing effective automated optimization for heterogeneous training programs.
Second, addressing performance bottlenecks in SPMD parallelism for trillion-parameter models, we present HiDup, that optimizes distributed tensor programs with computation-communication overlapping. We propose interleaving the execution of two microbatches, where computation of a microbatch overlaps communication of the other. A dynamic programming algorithm automates the search for optimal tensor sharding strategies within this restructured program, maximizing overlap and achieving up to 61% training speed-up.
Third, we develop HAP to further accelerate SPMD tensor program on heterogeneous clusters. We formulate the partitioning of the tensor program as a program synthesis problem, automatically generating an optimized distributed program from a single-device program using A*-based search. HAP automatically co-optimizes tensor sharding strategy, optimal device-specific sharding ratios (via linear programming), and efficient tensor communication methods. HAP achieves up to 2.41x speed-up, demonstrating effective automated optimization for SPMD programs on heterogeneous hardware.
Finally, we tackle the automated optimization of dynamic workflows composed of multiple tensor programs. Conditional execution in these workflows hinders static optimization. We introduce DyOrc, an workflow serving system deploying components as independently scalable services. DyOrc features: (i) speculative scheduling for effective batching of tensor computations; (ii) multi-tier message passing for efficient inter-program tensor communication; and (iii) proactive loading to hide startup latency. DyOrc improves serving latency by 4-198% on diverse dynamic workloads. |
| Degree | Doctor of Philosophy |
| Subject | Machine learning Deep learning (Machine learning) Parallel processing (Electronic computers) Program transformation (Computer programming) |
| Dept/Program | Computer Science |
| Persistent Identifier | http://hdl.handle.net/10722/363981 |
| DC Field | Value | Language |
|---|---|---|
| dc.contributor.author | Zhang, Shiwei | - |
| dc.contributor.author | 張實唯 | - |
| dc.date.accessioned | 2025-10-20T02:56:18Z | - |
| dc.date.available | 2025-10-20T02:56:18Z | - |
| dc.date.issued | 2025 | - |
| dc.identifier.citation | Zhang, S. [張實唯]. (2025). Automated optimization of distributed tensor programs for large DNN training and inference. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. | - |
| dc.identifier.uri | http://hdl.handle.net/10722/363981 | - |
| dc.description.abstract | Modern large-scale machine learning relies heavily on complex computations expressed as tensor programs, typically distributed across diverse hardware clusters. Optimizing these programs for performance presents significant challenges due to model scale, hardware heterogeneity, intricate parallelism strategies, and dynamic execution patterns. This thesis focuses on automated optimization of distributed tensor programs, developing systems and methodologies to automatically enhance the performance of both training and inference for large models. First, we introduce TAG for optimizing DNN training, viewing the process as deploying a tensor program onto heterogeneous devices. TAG employs a graph neural network (GNN) processing both the computation graph (tensor operations) and device topology, coupled with Monte-Carlo tree search, to automatically derive optimized distributed execution plans. It further automates communication optimization via sufficient factor broadcasting. TAG demonstrates up to 4.56x speed-up, showcasing effective automated optimization for heterogeneous training programs. Second, addressing performance bottlenecks in SPMD parallelism for trillion-parameter models, we present HiDup, that optimizes distributed tensor programs with computation-communication overlapping. We propose interleaving the execution of two microbatches, where computation of a microbatch overlaps communication of the other. A dynamic programming algorithm automates the search for optimal tensor sharding strategies within this restructured program, maximizing overlap and achieving up to 61% training speed-up. Third, we develop HAP to further accelerate SPMD tensor program on heterogeneous clusters. We formulate the partitioning of the tensor program as a program synthesis problem, automatically generating an optimized distributed program from a single-device program using A*-based search. HAP automatically co-optimizes tensor sharding strategy, optimal device-specific sharding ratios (via linear programming), and efficient tensor communication methods. HAP achieves up to 2.41x speed-up, demonstrating effective automated optimization for SPMD programs on heterogeneous hardware. Finally, we tackle the automated optimization of dynamic workflows composed of multiple tensor programs. Conditional execution in these workflows hinders static optimization. We introduce DyOrc, an workflow serving system deploying components as independently scalable services. DyOrc features: (i) speculative scheduling for effective batching of tensor computations; (ii) multi-tier message passing for efficient inter-program tensor communication; and (iii) proactive loading to hide startup latency. DyOrc improves serving latency by 4-198% on diverse dynamic workloads. | en |
| dc.language | eng | - |
| dc.publisher | The University of Hong Kong (Pokfulam, Hong Kong) | - |
| dc.relation.ispartof | HKU Theses Online (HKUTO) | - |
| dc.rights | The author retains all proprietary rights, (such as patent rights) and the right to use in future works. | - |
| dc.rights | This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. | - |
| dc.subject.lcsh | Machine learning | - |
| dc.subject.lcsh | Deep learning (Machine learning) | - |
| dc.subject.lcsh | Parallel processing (Electronic computers) | - |
| dc.subject.lcsh | Program transformation (Computer programming) | - |
| dc.title | Automated optimization of distributed tensor programs for large DNN training and inference | - |
| dc.type | PG_Thesis | - |
| dc.description.thesisname | Doctor of Philosophy | - |
| dc.description.thesislevel | Doctoral | - |
| dc.description.thesisdiscipline | Computer Science | - |
| dc.description.nature | published_or_final_version | - |
| dc.date.hkucongregation | 2025 | - |
| dc.identifier.mmsid | 991045117392003414 | - |
