File Download
Supplementary

postgraduate thesis: GPU-adaptive algorithms for large-scale optimization in cloud computing and large language models

TitleGPU-adaptive algorithms for large-scale optimization in cloud computing and large language models
Authors
Advisors
Advisor(s):Yuan, XZang, W
Issue Date2025
PublisherThe University of Hong Kong (Pokfulam, Hong Kong)
Citation
Zhao, P. [赵鹏翔]. (2025). GPU-adaptive algorithms for large-scale optimization in cloud computing and large language models. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
AbstractLarge-scale optimization is increasingly critical across domains such as cloud computing and large language models (LLMs), where problems often involve millions or even billions of variables and constraints. However, traditional optimization methods often struggle to address the immense computational complexity and stringent time constraints inherent in these real-world applications. This thesis develops optimization algorithms that are adaptive to modern Graphics Processing Units (GPUs), leveraging their parallel processing capabilities to improve scalability, efficiency, and practical applicability. Four key contributions are presented in this thesis. The first contribution focuses on cost-effective traffic allocation in cloud computing under the industry-standard 95th percentile billing scheme, a mixed-integer programming problem involving millions of decision variables. To address this challenge, we propose a two-stage GPU-adaptive framework, where the Circling Reduction Algorithm efficiently determines binary variables, followed by a GPU-adaptive Alternating Direction Method of Multipliers for the remaining large-scale linear programming subproblem. Validation on real-world cloud traffic data demonstrates that this method reduces bandwidth costs by up to 66.34% while achieving substantial speed improvements over traditional solvers. The second contribution addresses the memory-intensive nature of LLM pretraining by introducing Adapprox, a novel memory-efficient optimizer that reduces the memory footprint of optimizer states using GPU-adaptive randomized low-rank approximation of second-moment matrices. Additionally, Adapprox employs a dynamic rank adjustment mechanism to balance the memory efficiency and convergence performance. Compared to AdamW, it reduces memory overhead while maintaining or even improving convergence across models such as GPT-2 and BERT with millions of parameters. The third contribution enhances the efficiency of LLM inference with GANQ, a non-uniform post-training quantization framework. GANQ formulates quantization as a mixed-integer quadratic programming problem and employs GPU-adaptive decomposition strategies along with an alternating direction optimization framework to efficiently solve this large-scale problem. Experimental results show that GANQ significantly outperforms existing quantization methods in accuracy retention while accelerating inference by up to 2.57×. Finally, we present FISTAPruner, a layer-wise post-training pruning framework for LLMs that formulates pruning as an optimization problem and solves it using the GPU-adaptive Fast Iterative Shrinkage-Thresholding Algorithm. It supports both unstructured pruning and 2:4 semi-structured sparsity for enhanced efficiency and hardware compatibility. When applied to LLaMA-3-70B with 50% parameter pruning, FISTAPruner preserves up to 98.6% of the original zero-shot performance for unstructured sparsity and 95.6% for 2:4 semi-structured sparsity, outperforming existing pruning methods. Collectively, these contributions illustrate the effectiveness of GPU-adaptive optimization algorithms in tackling large-scale optimization challenges in real-world applications, while highlighting the importance of integrating advanced optimization techniques with hardware-aware algorithmic design.
DegreeDoctor of Philosophy
SubjectGraphics processing units
Mathematical optimization
Cloud computing
Natural language processing (Computer science)
Dept/ProgramMathematics
Persistent Identifierhttp://hdl.handle.net/10722/367409

 

DC FieldValueLanguage
dc.contributor.advisorYuan, X-
dc.contributor.advisorZang, W-
dc.contributor.authorZhao, Pengxiang-
dc.contributor.author赵鹏翔-
dc.date.accessioned2025-12-11T06:41:46Z-
dc.date.available2025-12-11T06:41:46Z-
dc.date.issued2025-
dc.identifier.citationZhao, P. [赵鹏翔]. (2025). GPU-adaptive algorithms for large-scale optimization in cloud computing and large language models. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.-
dc.identifier.urihttp://hdl.handle.net/10722/367409-
dc.description.abstractLarge-scale optimization is increasingly critical across domains such as cloud computing and large language models (LLMs), where problems often involve millions or even billions of variables and constraints. However, traditional optimization methods often struggle to address the immense computational complexity and stringent time constraints inherent in these real-world applications. This thesis develops optimization algorithms that are adaptive to modern Graphics Processing Units (GPUs), leveraging their parallel processing capabilities to improve scalability, efficiency, and practical applicability. Four key contributions are presented in this thesis. The first contribution focuses on cost-effective traffic allocation in cloud computing under the industry-standard 95th percentile billing scheme, a mixed-integer programming problem involving millions of decision variables. To address this challenge, we propose a two-stage GPU-adaptive framework, where the Circling Reduction Algorithm efficiently determines binary variables, followed by a GPU-adaptive Alternating Direction Method of Multipliers for the remaining large-scale linear programming subproblem. Validation on real-world cloud traffic data demonstrates that this method reduces bandwidth costs by up to 66.34% while achieving substantial speed improvements over traditional solvers. The second contribution addresses the memory-intensive nature of LLM pretraining by introducing Adapprox, a novel memory-efficient optimizer that reduces the memory footprint of optimizer states using GPU-adaptive randomized low-rank approximation of second-moment matrices. Additionally, Adapprox employs a dynamic rank adjustment mechanism to balance the memory efficiency and convergence performance. Compared to AdamW, it reduces memory overhead while maintaining or even improving convergence across models such as GPT-2 and BERT with millions of parameters. The third contribution enhances the efficiency of LLM inference with GANQ, a non-uniform post-training quantization framework. GANQ formulates quantization as a mixed-integer quadratic programming problem and employs GPU-adaptive decomposition strategies along with an alternating direction optimization framework to efficiently solve this large-scale problem. Experimental results show that GANQ significantly outperforms existing quantization methods in accuracy retention while accelerating inference by up to 2.57×. Finally, we present FISTAPruner, a layer-wise post-training pruning framework for LLMs that formulates pruning as an optimization problem and solves it using the GPU-adaptive Fast Iterative Shrinkage-Thresholding Algorithm. It supports both unstructured pruning and 2:4 semi-structured sparsity for enhanced efficiency and hardware compatibility. When applied to LLaMA-3-70B with 50% parameter pruning, FISTAPruner preserves up to 98.6% of the original zero-shot performance for unstructured sparsity and 95.6% for 2:4 semi-structured sparsity, outperforming existing pruning methods. Collectively, these contributions illustrate the effectiveness of GPU-adaptive optimization algorithms in tackling large-scale optimization challenges in real-world applications, while highlighting the importance of integrating advanced optimization techniques with hardware-aware algorithmic design.-
dc.languageeng-
dc.publisherThe University of Hong Kong (Pokfulam, Hong Kong)-
dc.relation.ispartofHKU Theses Online (HKUTO)-
dc.rightsThe author retains all proprietary rights, (such as patent rights) and the right to use in future works.-
dc.rightsThis work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.-
dc.subject.lcshGraphics processing units-
dc.subject.lcshMathematical optimization-
dc.subject.lcshCloud computing-
dc.subject.lcshNatural language processing (Computer science)-
dc.titleGPU-adaptive algorithms for large-scale optimization in cloud computing and large language models-
dc.typePG_Thesis-
dc.description.thesisnameDoctor of Philosophy-
dc.description.thesislevelDoctoral-
dc.description.thesisdisciplineMathematics-
dc.description.naturepublished_or_final_version-
dc.date.hkucongregation2025-
dc.identifier.mmsid991045147149003414-

Export via OAI-PMH Interface in XML Formats


OR


Export to Other Non-XML Formats