File Download
Supplementary
-
Citations:
- Appears in Collections:
postgraduate thesis: Resource management in cloud computing : algorithm and system co-design
Title | Resource management in cloud computing : algorithm and system co-design |
---|---|
Authors | |
Advisors | Advisor(s):Lau, FCM |
Issue Date | 2019 |
Publisher | The University of Hong Kong (Pokfulam, Hong Kong) |
Citation | Han, Z. [韩震华]. (2019). Resource management in cloud computing : algorithm and system co-design. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. |
Abstract | Cloud computing has become a standard technology for the rapid delivery of computing services supporting a wide range of applications. Resource management, i.e., how to dispatch and schedule the networking and computing resources, plays a key role in improving the efficiency in resource provisioning. It faces two critical challenges: (1) uncertainty including uncertain user demands and uncertain resource quality (e.g., software/hardware failure, interference) which makes resource planning difficult; and (2) diverse application requirements, which push cloud service providers to have to understand and exploit the unique requirements of different applications. These challenges are not only faced by cloud data-centers, but also by edge computing, which is a new paradigm to provide low-latency access of cloud resources for edge applications. To tackle the above challenges calls for co-design of the algorithm and the underlying system, which is the main theme of this thesis.
In Part I, we start with online and approximation solutions to deal with the uncertainty issue. Online algorithms are powerful methods that operate without any knowledge of future demands. Regardless of what future demands may arrive, online algorithms can always guarantee the performance by bounding the ratio between its performance and the offline optimum. We propose three online and approximation solutions tailored to three applications. First, we propose OnDisc, which is O(1/ε)-competitive with (1+ε) speed augmentation for job scheduling in edge-clouds. Second, we present Camul, which is O( log K )-competitive for cache management in edge-clouds (K is the total number of cache slots). Third, we propose SPIN for scheduling Bulk-Synchronous-Parallel (BSP) jobs, which is robust to estimation errors in job execution time.
Although, the online and approximation solutions can guarantee performance for any future demands, they come at a cost of average performance since they conservatively optimize the worst cases. In cloud environments, applications might exhibit predictability on their future demands. In Part II, we leverage machine learning to enable resource managers to better predict future demands for improving resource efficiency. We propose two online-learning based solutions for two applications. First, we demonstrate the predictability of virtual machine resource usage and propose MadVM based on approximate Markov Decision Processes. Second, we study the uplink user scheduling problem in Heterogeneous Cellular Network (HetNets) and propose OLIUS to make scheduling decisions by adaptively learning the environment from scratch.
We further build a system framework to schedule machine learning workloads, which is described in Part III. More specifically, we focus on managing multi-tenant deep learning clusters equipped with specialized accelerators, (e.g., GPUs). Simply retrofitting traditional resource management solutions could lead to severe sharing anomalies due to the uncertain and non-uniform resource demands from different tenants. We propose HiveD that guarantees a strict sharing safety condition so that users can behave as if they are using private clusters and without sacrificing resource utilization of shared clusters.
|
Degree | Doctor of Philosophy |
Subject | Cloud computing |
Dept/Program | Computer Science |
Persistent Identifier | http://hdl.handle.net/10722/283124 |
DC Field | Value | Language |
---|---|---|
dc.contributor.advisor | Lau, FCM | - |
dc.contributor.author | Han, Zhenhua | - |
dc.contributor.author | 韩震华 | - |
dc.date.accessioned | 2020-06-10T01:02:14Z | - |
dc.date.available | 2020-06-10T01:02:14Z | - |
dc.date.issued | 2019 | - |
dc.identifier.citation | Han, Z. [韩震华]. (2019). Resource management in cloud computing : algorithm and system co-design. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. | - |
dc.identifier.uri | http://hdl.handle.net/10722/283124 | - |
dc.description.abstract | Cloud computing has become a standard technology for the rapid delivery of computing services supporting a wide range of applications. Resource management, i.e., how to dispatch and schedule the networking and computing resources, plays a key role in improving the efficiency in resource provisioning. It faces two critical challenges: (1) uncertainty including uncertain user demands and uncertain resource quality (e.g., software/hardware failure, interference) which makes resource planning difficult; and (2) diverse application requirements, which push cloud service providers to have to understand and exploit the unique requirements of different applications. These challenges are not only faced by cloud data-centers, but also by edge computing, which is a new paradigm to provide low-latency access of cloud resources for edge applications. To tackle the above challenges calls for co-design of the algorithm and the underlying system, which is the main theme of this thesis. In Part I, we start with online and approximation solutions to deal with the uncertainty issue. Online algorithms are powerful methods that operate without any knowledge of future demands. Regardless of what future demands may arrive, online algorithms can always guarantee the performance by bounding the ratio between its performance and the offline optimum. We propose three online and approximation solutions tailored to three applications. First, we propose OnDisc, which is O(1/ε)-competitive with (1+ε) speed augmentation for job scheduling in edge-clouds. Second, we present Camul, which is O( log K )-competitive for cache management in edge-clouds (K is the total number of cache slots). Third, we propose SPIN for scheduling Bulk-Synchronous-Parallel (BSP) jobs, which is robust to estimation errors in job execution time. Although, the online and approximation solutions can guarantee performance for any future demands, they come at a cost of average performance since they conservatively optimize the worst cases. In cloud environments, applications might exhibit predictability on their future demands. In Part II, we leverage machine learning to enable resource managers to better predict future demands for improving resource efficiency. We propose two online-learning based solutions for two applications. First, we demonstrate the predictability of virtual machine resource usage and propose MadVM based on approximate Markov Decision Processes. Second, we study the uplink user scheduling problem in Heterogeneous Cellular Network (HetNets) and propose OLIUS to make scheduling decisions by adaptively learning the environment from scratch. We further build a system framework to schedule machine learning workloads, which is described in Part III. More specifically, we focus on managing multi-tenant deep learning clusters equipped with specialized accelerators, (e.g., GPUs). Simply retrofitting traditional resource management solutions could lead to severe sharing anomalies due to the uncertain and non-uniform resource demands from different tenants. We propose HiveD that guarantees a strict sharing safety condition so that users can behave as if they are using private clusters and without sacrificing resource utilization of shared clusters. | - |
dc.language | eng | - |
dc.publisher | The University of Hong Kong (Pokfulam, Hong Kong) | - |
dc.relation.ispartof | HKU Theses Online (HKUTO) | - |
dc.rights | The author retains all proprietary rights, (such as patent rights) and the right to use in future works. | - |
dc.rights | This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. | - |
dc.subject.lcsh | Cloud computing | - |
dc.title | Resource management in cloud computing : algorithm and system co-design | - |
dc.type | PG_Thesis | - |
dc.description.thesisname | Doctor of Philosophy | - |
dc.description.thesislevel | Doctoral | - |
dc.description.thesisdiscipline | Computer Science | - |
dc.description.nature | published_or_final_version | - |
dc.date.hkucongregation | 2020 | - |
dc.identifier.mmsid | 991044242097603414 | - |