File Download
Supplementary
-
Citations:
- Appears in Collections:
postgraduate thesis: Some optimization algorithms for applications in cloud computing and large language models
| Title | Some optimization algorithms for applications in cloud computing and large language models |
|---|---|
| Authors | |
| Advisors | Advisor(s):Yuan, X |
| Issue Date | 2025 |
| Publisher | The University of Hong Kong (Pokfulam, Hong Kong) |
| Citation | Hu, H. [胡翰宇]. (2025). Some optimization algorithms for applications in cloud computing and large language models. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. |
| Abstract | This thesis develops novel optimization algorithms to address two critical challenges in modern information technologies: cost-efficient bandwidth allocation for cloud-based livestreaming services and efficient compression of large language models (LLMs) through unstructured, semi-structured, and structured pruning. Both problems are large-scale, combinatorial in nature, and bear significant practical implications.
For bandwidth allocation, we propose a three-phase optimization framework to minimize costs under the industry-standard 95th percentile billing. Our approach integrates circling reduction techniques with clustering-guided reallocations and constraint restoration, achieving reductions of up to 86% in billed outflow bandwidth and 42% in billed back-to-source bandwidth compared to baseline methods, while respecting indivisible stream constraints and quality-of-service requirements.
For LLM compression, we advance the field through two key contributions. First, we present FISTAPruner for unstructured and semi-structured pruning, which introduces a convex optimization framework leveraging the Fast Iterative Shrinkage-Thresholding Algorithm (FISTA) with intra-layer error correction and adaptive hyperparameter tuning. This approach achieves 50% sparsity while retaining 98.6% of zero-shot task performance on LLaMA-3-70B. Second, we formalize an interlinking operator property in transformers and develop rigorous mixed-integer optimization models for structured pruning. Two novel structured pruning methods, FASP and SPAP, are proposed to address this challenge. FASP combines efficient importance scoring with optimal weight reconstruction, enabling structured pruning with high efficiency while preserving model performance. SPAP employs penalty methods and alternating minimization, further enhancing pruned model performance through optimization-driven approaches. Our structured pruning methods demonstrate superior performance across seven LLM families, including OPT, LLaMA-1/2/3/3.1/3.2, and Qwen2.5, achieving hardware-independent inference speedups of 1.29× at 30% sparsity with proportional memory reductions.
The proposed methods are rigorously evaluated on synthetic and real-world cloud datasets as well as a wide range of LLMs, demonstrating scalability, practical applicability and superior performance compared with state-of-the-art baselines. This thesis bridges gaps in resource allocation and model compression, providing tools to reduce operational costs in cloud computing and enable efficient deployment of LLMs in resource-constrained environments.
|
| Degree | Doctor of Philosophy |
| Subject | Mathematical optimization Cloud computing Natural language processing (Computer science) |
| Dept/Program | Mathematics |
| Persistent Identifier | http://hdl.handle.net/10722/367422 |
| DC Field | Value | Language |
|---|---|---|
| dc.contributor.advisor | Yuan, X | - |
| dc.contributor.author | Hu, Hanyu | - |
| dc.contributor.author | 胡翰宇 | - |
| dc.date.accessioned | 2025-12-11T06:41:52Z | - |
| dc.date.available | 2025-12-11T06:41:52Z | - |
| dc.date.issued | 2025 | - |
| dc.identifier.citation | Hu, H. [胡翰宇]. (2025). Some optimization algorithms for applications in cloud computing and large language models. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. | - |
| dc.identifier.uri | http://hdl.handle.net/10722/367422 | - |
| dc.description.abstract | This thesis develops novel optimization algorithms to address two critical challenges in modern information technologies: cost-efficient bandwidth allocation for cloud-based livestreaming services and efficient compression of large language models (LLMs) through unstructured, semi-structured, and structured pruning. Both problems are large-scale, combinatorial in nature, and bear significant practical implications. For bandwidth allocation, we propose a three-phase optimization framework to minimize costs under the industry-standard 95th percentile billing. Our approach integrates circling reduction techniques with clustering-guided reallocations and constraint restoration, achieving reductions of up to 86% in billed outflow bandwidth and 42% in billed back-to-source bandwidth compared to baseline methods, while respecting indivisible stream constraints and quality-of-service requirements. For LLM compression, we advance the field through two key contributions. First, we present FISTAPruner for unstructured and semi-structured pruning, which introduces a convex optimization framework leveraging the Fast Iterative Shrinkage-Thresholding Algorithm (FISTA) with intra-layer error correction and adaptive hyperparameter tuning. This approach achieves 50% sparsity while retaining 98.6% of zero-shot task performance on LLaMA-3-70B. Second, we formalize an interlinking operator property in transformers and develop rigorous mixed-integer optimization models for structured pruning. Two novel structured pruning methods, FASP and SPAP, are proposed to address this challenge. FASP combines efficient importance scoring with optimal weight reconstruction, enabling structured pruning with high efficiency while preserving model performance. SPAP employs penalty methods and alternating minimization, further enhancing pruned model performance through optimization-driven approaches. Our structured pruning methods demonstrate superior performance across seven LLM families, including OPT, LLaMA-1/2/3/3.1/3.2, and Qwen2.5, achieving hardware-independent inference speedups of 1.29× at 30% sparsity with proportional memory reductions. The proposed methods are rigorously evaluated on synthetic and real-world cloud datasets as well as a wide range of LLMs, demonstrating scalability, practical applicability and superior performance compared with state-of-the-art baselines. This thesis bridges gaps in resource allocation and model compression, providing tools to reduce operational costs in cloud computing and enable efficient deployment of LLMs in resource-constrained environments. | - |
| dc.language | eng | - |
| dc.publisher | The University of Hong Kong (Pokfulam, Hong Kong) | - |
| dc.relation.ispartof | HKU Theses Online (HKUTO) | - |
| dc.rights | The author retains all proprietary rights, (such as patent rights) and the right to use in future works. | - |
| dc.rights | This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. | - |
| dc.subject.lcsh | Mathematical optimization | - |
| dc.subject.lcsh | Cloud computing | - |
| dc.subject.lcsh | Natural language processing (Computer science) | - |
| dc.title | Some optimization algorithms for applications in cloud computing and large language models | - |
| dc.type | PG_Thesis | - |
| dc.description.thesisname | Doctor of Philosophy | - |
| dc.description.thesislevel | Doctoral | - |
| dc.description.thesisdiscipline | Mathematics | - |
| dc.description.nature | published_or_final_version | - |
| dc.date.hkucongregation | 2025 | - |
| dc.identifier.mmsid | 991045147148803414 | - |
