Some optimization algorithms for applications in cloud computing and large language models

Hu, Hanyu; 胡翰宇

File Download

FullText.pdf

Supplementary

Citations:
Appears in Collections:
- HKU Theses Online
- Mathematics: Theses

postgraduate thesis: Some optimization algorithms for applications in cloud computing and large language models

Title	Some optimization algorithms for applications in cloud computing and large language models
Authors	Hu, Hanyu 胡翰宇
Advisors	Advisor(s):Yuan, X
Issue Date	2025
Publisher	The University of Hong Kong (Pokfulam, Hong Kong)
Citation	Hu, H. [胡翰宇]. (2025). Some optimization algorithms for applications in cloud computing and large language models. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
Abstract	This thesis develops novel optimization algorithms to address two critical challenges in modern information technologies: cost-efficient bandwidth allocation for cloud-based livestreaming services and efficient compression of large language models (LLMs) through unstructured, semi-structured, and structured pruning. Both problems are large-scale, combinatorial in nature, and bear significant practical implications. For bandwidth allocation, we propose a three-phase optimization framework to minimize costs under the industry-standard 95th percentile billing. Our approach integrates circling reduction techniques with clustering-guided reallocations and constraint restoration, achieving reductions of up to 86% in billed outflow bandwidth and 42% in billed back-to-source bandwidth compared to baseline methods, while respecting indivisible stream constraints and quality-of-service requirements. For LLM compression, we advance the field through two key contributions. First, we present FISTAPruner for unstructured and semi-structured pruning, which introduces a convex optimization framework leveraging the Fast Iterative Shrinkage-Thresholding Algorithm (FISTA) with intra-layer error correction and adaptive hyperparameter tuning. This approach achieves 50% sparsity while retaining 98.6% of zero-shot task performance on LLaMA-3-70B. Second, we formalize an interlinking operator property in transformers and develop rigorous mixed-integer optimization models for structured pruning. Two novel structured pruning methods, FASP and SPAP, are proposed to address this challenge. FASP combines efficient importance scoring with optimal weight reconstruction, enabling structured pruning with high efficiency while preserving model performance. SPAP employs penalty methods and alternating minimization, further enhancing pruned model performance through optimization-driven approaches. Our structured pruning methods demonstrate superior performance across seven LLM families, including OPT, LLaMA-1/2/3/3.1/3.2, and Qwen2.5, achieving hardware-independent inference speedups of 1.29× at 30% sparsity with proportional memory reductions. The proposed methods are rigorously evaluated on synthetic and real-world cloud datasets as well as a wide range of LLMs, demonstrating scalability, practical applicability and superior performance compared with state-of-the-art baselines. This thesis bridges gaps in resource allocation and model compression, providing tools to reduce operational costs in cloud computing and enable efficient deployment of LLMs in resource-constrained environments.
Degree	Doctor of Philosophy
Subject	Mathematical optimization Cloud computing Natural language processing (Computer science)
Dept/Program	Mathematics
Persistent Identifier	http://hdl.handle.net/10722/367422

DC Field	Value	Language
dc.contributor.advisor	Yuan, X	-
dc.contributor.author	Hu, Hanyu	-
dc.contributor.author	胡翰宇	-
dc.date.accessioned	2025-12-11T06:41:52Z	-
dc.date.available	2025-12-11T06:41:52Z	-
dc.date.issued	2025	-
dc.identifier.citation	Hu, H. [胡翰宇]. (2025). Some optimization algorithms for applications in cloud computing and large language models. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.	-
dc.identifier.uri	http://hdl.handle.net/10722/367422	-
dc.description.abstract	This thesis develops novel optimization algorithms to address two critical challenges in modern information technologies: cost-efficient bandwidth allocation for cloud-based livestreaming services and efficient compression of large language models (LLMs) through unstructured, semi-structured, and structured pruning. Both problems are large-scale, combinatorial in nature, and bear significant practical implications. For bandwidth allocation, we propose a three-phase optimization framework to minimize costs under the industry-standard 95th percentile billing. Our approach integrates circling reduction techniques with clustering-guided reallocations and constraint restoration, achieving reductions of up to 86% in billed outflow bandwidth and 42% in billed back-to-source bandwidth compared to baseline methods, while respecting indivisible stream constraints and quality-of-service requirements. For LLM compression, we advance the field through two key contributions. First, we present FISTAPruner for unstructured and semi-structured pruning, which introduces a convex optimization framework leveraging the Fast Iterative Shrinkage-Thresholding Algorithm (FISTA) with intra-layer error correction and adaptive hyperparameter tuning. This approach achieves 50% sparsity while retaining 98.6% of zero-shot task performance on LLaMA-3-70B. Second, we formalize an interlinking operator property in transformers and develop rigorous mixed-integer optimization models for structured pruning. Two novel structured pruning methods, FASP and SPAP, are proposed to address this challenge. FASP combines efficient importance scoring with optimal weight reconstruction, enabling structured pruning with high efficiency while preserving model performance. SPAP employs penalty methods and alternating minimization, further enhancing pruned model performance through optimization-driven approaches. Our structured pruning methods demonstrate superior performance across seven LLM families, including OPT, LLaMA-1/2/3/3.1/3.2, and Qwen2.5, achieving hardware-independent inference speedups of 1.29× at 30% sparsity with proportional memory reductions. The proposed methods are rigorously evaluated on synthetic and real-world cloud datasets as well as a wide range of LLMs, demonstrating scalability, practical applicability and superior performance compared with state-of-the-art baselines. This thesis bridges gaps in resource allocation and model compression, providing tools to reduce operational costs in cloud computing and enable efficient deployment of LLMs in resource-constrained environments.	-
dc.language	eng	-
dc.publisher	The University of Hong Kong (Pokfulam, Hong Kong)	-
dc.relation.ispartof	HKU Theses Online (HKUTO)	-
dc.rights	The author retains all proprietary rights, (such as patent rights) and the right to use in future works.	-
dc.rights	This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.	-
dc.subject.lcsh	Mathematical optimization	-
dc.subject.lcsh	Cloud computing	-
dc.subject.lcsh	Natural language processing (Computer science)	-
dc.title	Some optimization algorithms for applications in cloud computing and large language models	-
dc.type	PG_Thesis	-
dc.description.thesisname	Doctor of Philosophy	-
dc.description.thesislevel	Doctoral	-
dc.description.thesisdiscipline	Mathematics	-
dc.description.nature	published_or_final_version	-
dc.date.hkucongregation	2025	-
dc.identifier.mmsid	991045147148803414	-

File Download

Supplementary

postgraduate thesis: Some optimization algorithms for applications in cloud computing and large language models

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats