GPU-adaptive algorithms for large-scale optimization in cloud computing and large language models

Zhao, Pengxiang; 赵鹏翔

File Download

FullText.pdf

Supplementary

Citations:
Appears in Collections:
- HKU Theses Online
- Mathematics: Theses

postgraduate thesis: GPU-adaptive algorithms for large-scale optimization in cloud computing and large language models

Title	GPU-adaptive algorithms for large-scale optimization in cloud computing and large language models
Authors	Zhao, Pengxiang 赵鹏翔
Advisors	Advisor(s):Yuan, X Zang, W
Issue Date	2025
Publisher	The University of Hong Kong (Pokfulam, Hong Kong)
Citation	Zhao, P. [赵鹏翔]. (2025). GPU-adaptive algorithms for large-scale optimization in cloud computing and large language models. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
Abstract	Large-scale optimization is increasingly critical across domains such as cloud computing and large language models (LLMs), where problems often involve millions or even billions of variables and constraints. However, traditional optimization methods often struggle to address the immense computational complexity and stringent time constraints inherent in these real-world applications. This thesis develops optimization algorithms that are adaptive to modern Graphics Processing Units (GPUs), leveraging their parallel processing capabilities to improve scalability, efficiency, and practical applicability. Four key contributions are presented in this thesis. The first contribution focuses on cost-effective traffic allocation in cloud computing under the industry-standard 95th percentile billing scheme, a mixed-integer programming problem involving millions of decision variables. To address this challenge, we propose a two-stage GPU-adaptive framework, where the Circling Reduction Algorithm efficiently determines binary variables, followed by a GPU-adaptive Alternating Direction Method of Multipliers for the remaining large-scale linear programming subproblem. Validation on real-world cloud traffic data demonstrates that this method reduces bandwidth costs by up to 66.34% while achieving substantial speed improvements over traditional solvers. The second contribution addresses the memory-intensive nature of LLM pretraining by introducing Adapprox, a novel memory-efficient optimizer that reduces the memory footprint of optimizer states using GPU-adaptive randomized low-rank approximation of second-moment matrices. Additionally, Adapprox employs a dynamic rank adjustment mechanism to balance the memory efficiency and convergence performance. Compared to AdamW, it reduces memory overhead while maintaining or even improving convergence across models such as GPT-2 and BERT with millions of parameters. The third contribution enhances the efficiency of LLM inference with GANQ, a non-uniform post-training quantization framework. GANQ formulates quantization as a mixed-integer quadratic programming problem and employs GPU-adaptive decomposition strategies along with an alternating direction optimization framework to efficiently solve this large-scale problem. Experimental results show that GANQ significantly outperforms existing quantization methods in accuracy retention while accelerating inference by up to 2.57×. Finally, we present FISTAPruner, a layer-wise post-training pruning framework for LLMs that formulates pruning as an optimization problem and solves it using the GPU-adaptive Fast Iterative Shrinkage-Thresholding Algorithm. It supports both unstructured pruning and 2:4 semi-structured sparsity for enhanced efficiency and hardware compatibility. When applied to LLaMA-3-70B with 50% parameter pruning, FISTAPruner preserves up to 98.6% of the original zero-shot performance for unstructured sparsity and 95.6% for 2:4 semi-structured sparsity, outperforming existing pruning methods. Collectively, these contributions illustrate the effectiveness of GPU-adaptive optimization algorithms in tackling large-scale optimization challenges in real-world applications, while highlighting the importance of integrating advanced optimization techniques with hardware-aware algorithmic design.
Degree	Doctor of Philosophy
Subject	Graphics processing units Mathematical optimization Cloud computing Natural language processing (Computer science)
Dept/Program	Mathematics
Persistent Identifier	http://hdl.handle.net/10722/367409

DC Field	Value	Language
dc.contributor.advisor	Yuan, X	-
dc.contributor.advisor	Zang, W	-
dc.contributor.author	Zhao, Pengxiang	-
dc.contributor.author	赵鹏翔	-
dc.date.accessioned	2025-12-11T06:41:46Z	-
dc.date.available	2025-12-11T06:41:46Z	-
dc.date.issued	2025	-
dc.identifier.citation	Zhao, P. [赵鹏翔]. (2025). GPU-adaptive algorithms for large-scale optimization in cloud computing and large language models. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.	-
dc.identifier.uri	http://hdl.handle.net/10722/367409	-
dc.description.abstract	Large-scale optimization is increasingly critical across domains such as cloud computing and large language models (LLMs), where problems often involve millions or even billions of variables and constraints. However, traditional optimization methods often struggle to address the immense computational complexity and stringent time constraints inherent in these real-world applications. This thesis develops optimization algorithms that are adaptive to modern Graphics Processing Units (GPUs), leveraging their parallel processing capabilities to improve scalability, efficiency, and practical applicability. Four key contributions are presented in this thesis. The first contribution focuses on cost-effective traffic allocation in cloud computing under the industry-standard 95th percentile billing scheme, a mixed-integer programming problem involving millions of decision variables. To address this challenge, we propose a two-stage GPU-adaptive framework, where the Circling Reduction Algorithm efficiently determines binary variables, followed by a GPU-adaptive Alternating Direction Method of Multipliers for the remaining large-scale linear programming subproblem. Validation on real-world cloud traffic data demonstrates that this method reduces bandwidth costs by up to 66.34% while achieving substantial speed improvements over traditional solvers. The second contribution addresses the memory-intensive nature of LLM pretraining by introducing Adapprox, a novel memory-efficient optimizer that reduces the memory footprint of optimizer states using GPU-adaptive randomized low-rank approximation of second-moment matrices. Additionally, Adapprox employs a dynamic rank adjustment mechanism to balance the memory efficiency and convergence performance. Compared to AdamW, it reduces memory overhead while maintaining or even improving convergence across models such as GPT-2 and BERT with millions of parameters. The third contribution enhances the efficiency of LLM inference with GANQ, a non-uniform post-training quantization framework. GANQ formulates quantization as a mixed-integer quadratic programming problem and employs GPU-adaptive decomposition strategies along with an alternating direction optimization framework to efficiently solve this large-scale problem. Experimental results show that GANQ significantly outperforms existing quantization methods in accuracy retention while accelerating inference by up to 2.57×. Finally, we present FISTAPruner, a layer-wise post-training pruning framework for LLMs that formulates pruning as an optimization problem and solves it using the GPU-adaptive Fast Iterative Shrinkage-Thresholding Algorithm. It supports both unstructured pruning and 2:4 semi-structured sparsity for enhanced efficiency and hardware compatibility. When applied to LLaMA-3-70B with 50% parameter pruning, FISTAPruner preserves up to 98.6% of the original zero-shot performance for unstructured sparsity and 95.6% for 2:4 semi-structured sparsity, outperforming existing pruning methods. Collectively, these contributions illustrate the effectiveness of GPU-adaptive optimization algorithms in tackling large-scale optimization challenges in real-world applications, while highlighting the importance of integrating advanced optimization techniques with hardware-aware algorithmic design.	-
dc.language	eng	-
dc.publisher	The University of Hong Kong (Pokfulam, Hong Kong)	-
dc.relation.ispartof	HKU Theses Online (HKUTO)	-
dc.rights	The author retains all proprietary rights, (such as patent rights) and the right to use in future works.	-
dc.rights	This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.	-
dc.subject.lcsh	Graphics processing units	-
dc.subject.lcsh	Mathematical optimization	-
dc.subject.lcsh	Cloud computing	-
dc.subject.lcsh	Natural language processing (Computer science)	-
dc.title	GPU-adaptive algorithms for large-scale optimization in cloud computing and large language models	-
dc.type	PG_Thesis	-
dc.description.thesisname	Doctor of Philosophy	-
dc.description.thesislevel	Doctoral	-
dc.description.thesisdiscipline	Mathematics	-
dc.description.nature	published_or_final_version	-
dc.date.hkucongregation	2025	-
dc.identifier.mmsid	991045147149003414	-

File Download

Supplementary

postgraduate thesis: GPU-adaptive algorithms for large-scale optimization in cloud computing and large language models

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats