File Download
Supplementary
-
Citations:
- Appears in Collections:
postgraduate thesis: GPU-adaptive algorithms for large-scale optimization in cloud computing and large language models
| Title | GPU-adaptive algorithms for large-scale optimization in cloud computing and large language models |
|---|---|
| Authors | |
| Advisors | |
| Issue Date | 2025 |
| Publisher | The University of Hong Kong (Pokfulam, Hong Kong) |
| Citation | Zhao, P. [赵鹏翔]. (2025). GPU-adaptive algorithms for large-scale optimization in cloud computing and large language models. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. |
| Abstract | Large-scale optimization is increasingly critical across domains such as cloud computing and large language models (LLMs), where problems often involve millions or even billions of variables and constraints. However, traditional optimization methods often struggle to address the immense computational complexity and stringent time constraints inherent in these real-world applications. This thesis develops optimization algorithms that are adaptive to modern Graphics Processing Units (GPUs), leveraging their parallel processing capabilities to improve scalability, efficiency, and practical applicability. Four key contributions are presented in this thesis. The first contribution focuses on cost-effective traffic allocation in cloud computing under the industry-standard 95th percentile billing scheme, a mixed-integer programming problem involving millions of decision variables. To address this challenge, we propose a two-stage GPU-adaptive framework, where the Circling Reduction Algorithm efficiently determines binary variables, followed by a GPU-adaptive Alternating Direction Method of Multipliers for the remaining large-scale linear programming subproblem. Validation on real-world cloud traffic data demonstrates that this method reduces bandwidth costs by up to 66.34% while achieving substantial speed improvements over traditional solvers. The second contribution addresses the memory-intensive nature of LLM pretraining by introducing Adapprox, a novel memory-efficient optimizer that reduces the memory footprint of optimizer states using GPU-adaptive randomized low-rank approximation of second-moment matrices. Additionally, Adapprox employs a dynamic rank adjustment mechanism to balance the memory efficiency and convergence performance. Compared to AdamW, it reduces memory overhead while maintaining or even improving convergence across models such as GPT-2 and BERT with millions of parameters. The third contribution enhances the efficiency of LLM inference with GANQ, a non-uniform post-training quantization framework. GANQ formulates quantization as a mixed-integer quadratic programming problem and employs GPU-adaptive decomposition strategies along with an alternating direction optimization framework to efficiently solve this large-scale problem. Experimental results show that GANQ significantly outperforms existing quantization methods in accuracy retention while accelerating inference by up to 2.57×. Finally, we present FISTAPruner, a layer-wise post-training pruning framework for LLMs that formulates pruning as an optimization problem and solves it using the GPU-adaptive Fast Iterative Shrinkage-Thresholding Algorithm. It supports both unstructured pruning and 2:4 semi-structured sparsity for enhanced efficiency and hardware compatibility. When applied to LLaMA-3-70B with 50% parameter pruning, FISTAPruner preserves up to 98.6% of the original zero-shot performance for unstructured sparsity and 95.6% for 2:4 semi-structured sparsity, outperforming existing pruning methods. Collectively, these contributions illustrate the effectiveness of GPU-adaptive optimization algorithms in tackling large-scale optimization challenges in real-world applications, while highlighting the importance of integrating advanced optimization techniques with hardware-aware algorithmic design. |
| Degree | Doctor of Philosophy |
| Subject | Graphics processing units Mathematical optimization Cloud computing Natural language processing (Computer science) |
| Dept/Program | Mathematics |
| Persistent Identifier | http://hdl.handle.net/10722/367409 |
| DC Field | Value | Language |
|---|---|---|
| dc.contributor.advisor | Yuan, X | - |
| dc.contributor.advisor | Zang, W | - |
| dc.contributor.author | Zhao, Pengxiang | - |
| dc.contributor.author | 赵鹏翔 | - |
| dc.date.accessioned | 2025-12-11T06:41:46Z | - |
| dc.date.available | 2025-12-11T06:41:46Z | - |
| dc.date.issued | 2025 | - |
| dc.identifier.citation | Zhao, P. [赵鹏翔]. (2025). GPU-adaptive algorithms for large-scale optimization in cloud computing and large language models. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. | - |
| dc.identifier.uri | http://hdl.handle.net/10722/367409 | - |
| dc.description.abstract | Large-scale optimization is increasingly critical across domains such as cloud computing and large language models (LLMs), where problems often involve millions or even billions of variables and constraints. However, traditional optimization methods often struggle to address the immense computational complexity and stringent time constraints inherent in these real-world applications. This thesis develops optimization algorithms that are adaptive to modern Graphics Processing Units (GPUs), leveraging their parallel processing capabilities to improve scalability, efficiency, and practical applicability. Four key contributions are presented in this thesis. The first contribution focuses on cost-effective traffic allocation in cloud computing under the industry-standard 95th percentile billing scheme, a mixed-integer programming problem involving millions of decision variables. To address this challenge, we propose a two-stage GPU-adaptive framework, where the Circling Reduction Algorithm efficiently determines binary variables, followed by a GPU-adaptive Alternating Direction Method of Multipliers for the remaining large-scale linear programming subproblem. Validation on real-world cloud traffic data demonstrates that this method reduces bandwidth costs by up to 66.34% while achieving substantial speed improvements over traditional solvers. The second contribution addresses the memory-intensive nature of LLM pretraining by introducing Adapprox, a novel memory-efficient optimizer that reduces the memory footprint of optimizer states using GPU-adaptive randomized low-rank approximation of second-moment matrices. Additionally, Adapprox employs a dynamic rank adjustment mechanism to balance the memory efficiency and convergence performance. Compared to AdamW, it reduces memory overhead while maintaining or even improving convergence across models such as GPT-2 and BERT with millions of parameters. The third contribution enhances the efficiency of LLM inference with GANQ, a non-uniform post-training quantization framework. GANQ formulates quantization as a mixed-integer quadratic programming problem and employs GPU-adaptive decomposition strategies along with an alternating direction optimization framework to efficiently solve this large-scale problem. Experimental results show that GANQ significantly outperforms existing quantization methods in accuracy retention while accelerating inference by up to 2.57×. Finally, we present FISTAPruner, a layer-wise post-training pruning framework for LLMs that formulates pruning as an optimization problem and solves it using the GPU-adaptive Fast Iterative Shrinkage-Thresholding Algorithm. It supports both unstructured pruning and 2:4 semi-structured sparsity for enhanced efficiency and hardware compatibility. When applied to LLaMA-3-70B with 50% parameter pruning, FISTAPruner preserves up to 98.6% of the original zero-shot performance for unstructured sparsity and 95.6% for 2:4 semi-structured sparsity, outperforming existing pruning methods. Collectively, these contributions illustrate the effectiveness of GPU-adaptive optimization algorithms in tackling large-scale optimization challenges in real-world applications, while highlighting the importance of integrating advanced optimization techniques with hardware-aware algorithmic design. | - |
| dc.language | eng | - |
| dc.publisher | The University of Hong Kong (Pokfulam, Hong Kong) | - |
| dc.relation.ispartof | HKU Theses Online (HKUTO) | - |
| dc.rights | The author retains all proprietary rights, (such as patent rights) and the right to use in future works. | - |
| dc.rights | This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. | - |
| dc.subject.lcsh | Graphics processing units | - |
| dc.subject.lcsh | Mathematical optimization | - |
| dc.subject.lcsh | Cloud computing | - |
| dc.subject.lcsh | Natural language processing (Computer science) | - |
| dc.title | GPU-adaptive algorithms for large-scale optimization in cloud computing and large language models | - |
| dc.type | PG_Thesis | - |
| dc.description.thesisname | Doctor of Philosophy | - |
| dc.description.thesislevel | Doctoral | - |
| dc.description.thesisdiscipline | Mathematics | - |
| dc.description.nature | published_or_final_version | - |
| dc.date.hkucongregation | 2025 | - |
| dc.identifier.mmsid | 991045147149003414 | - |
