Resource management in cloud computing : algorithm and system co-design

Han, Zhenhua; 韩震华

File Download

FullText.pdf

Supplementary

Citations:
Appears in Collections:
- Computer Science: Theses
- HKU Theses Online

postgraduate thesis: Resource management in cloud computing : algorithm and system co-design

Title	Resource management in cloud computing : algorithm and system co-design
Authors	Han, Zhenhua 韩震华
Advisors	Advisor(s):Lau, FCM
Issue Date	2019
Publisher	The University of Hong Kong (Pokfulam, Hong Kong)
Citation	Han, Z. [韩震华]. (2019). Resource management in cloud computing : algorithm and system co-design. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
Abstract	Cloud computing has become a standard technology for the rapid delivery of computing services supporting a wide range of applications. Resource management, i.e., how to dispatch and schedule the networking and computing resources, plays a key role in improving the efficiency in resource provisioning. It faces two critical challenges: (1) uncertainty including uncertain user demands and uncertain resource quality (e.g., software/hardware failure, interference) which makes resource planning difficult; and (2) diverse application requirements, which push cloud service providers to have to understand and exploit the unique requirements of different applications. These challenges are not only faced by cloud data-centers, but also by edge computing, which is a new paradigm to provide low-latency access of cloud resources for edge applications. To tackle the above challenges calls for co-design of the algorithm and the underlying system, which is the main theme of this thesis. In Part I, we start with online and approximation solutions to deal with the uncertainty issue. Online algorithms are powerful methods that operate without any knowledge of future demands. Regardless of what future demands may arrive, online algorithms can always guarantee the performance by bounding the ratio between its performance and the offline optimum. We propose three online and approximation solutions tailored to three applications. First, we propose OnDisc, which is O(1/ε)-competitive with (1+ε) speed augmentation for job scheduling in edge-clouds. Second, we present Camul, which is O( log K )-competitive for cache management in edge-clouds (K is the total number of cache slots). Third, we propose SPIN for scheduling Bulk-Synchronous-Parallel (BSP) jobs, which is robust to estimation errors in job execution time. Although, the online and approximation solutions can guarantee performance for any future demands, they come at a cost of average performance since they conservatively optimize the worst cases. In cloud environments, applications might exhibit predictability on their future demands. In Part II, we leverage machine learning to enable resource managers to better predict future demands for improving resource efficiency. We propose two online-learning based solutions for two applications. First, we demonstrate the predictability of virtual machine resource usage and propose MadVM based on approximate Markov Decision Processes. Second, we study the uplink user scheduling problem in Heterogeneous Cellular Network (HetNets) and propose OLIUS to make scheduling decisions by adaptively learning the environment from scratch. We further build a system framework to schedule machine learning workloads, which is described in Part III. More specifically, we focus on managing multi-tenant deep learning clusters equipped with specialized accelerators, (e.g., GPUs). Simply retrofitting traditional resource management solutions could lead to severe sharing anomalies due to the uncertain and non-uniform resource demands from different tenants. We propose HiveD that guarantees a strict sharing safety condition so that users can behave as if they are using private clusters and without sacrificing resource utilization of shared clusters.
Degree	Doctor of Philosophy
Subject	Cloud computing
Dept/Program	Computer Science
Persistent Identifier	http://hdl.handle.net/10722/283124

DC Field	Value	Language
dc.contributor.advisor	Lau, FCM	-
dc.contributor.author	Han, Zhenhua	-
dc.contributor.author	韩震华	-
dc.date.accessioned	2020-06-10T01:02:14Z	-
dc.date.available	2020-06-10T01:02:14Z	-
dc.date.issued	2019	-
dc.identifier.citation	Han, Z. [韩震华]. (2019). Resource management in cloud computing : algorithm and system co-design. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.	-
dc.identifier.uri	http://hdl.handle.net/10722/283124	-
dc.description.abstract	Cloud computing has become a standard technology for the rapid delivery of computing services supporting a wide range of applications. Resource management, i.e., how to dispatch and schedule the networking and computing resources, plays a key role in improving the efficiency in resource provisioning. It faces two critical challenges: (1) uncertainty including uncertain user demands and uncertain resource quality (e.g., software/hardware failure, interference) which makes resource planning difficult; and (2) diverse application requirements, which push cloud service providers to have to understand and exploit the unique requirements of different applications. These challenges are not only faced by cloud data-centers, but also by edge computing, which is a new paradigm to provide low-latency access of cloud resources for edge applications. To tackle the above challenges calls for co-design of the algorithm and the underlying system, which is the main theme of this thesis. In Part I, we start with online and approximation solutions to deal with the uncertainty issue. Online algorithms are powerful methods that operate without any knowledge of future demands. Regardless of what future demands may arrive, online algorithms can always guarantee the performance by bounding the ratio between its performance and the offline optimum. We propose three online and approximation solutions tailored to three applications. First, we propose OnDisc, which is O(1/ε)-competitive with (1+ε) speed augmentation for job scheduling in edge-clouds. Second, we present Camul, which is O( log K )-competitive for cache management in edge-clouds (K is the total number of cache slots). Third, we propose SPIN for scheduling Bulk-Synchronous-Parallel (BSP) jobs, which is robust to estimation errors in job execution time. Although, the online and approximation solutions can guarantee performance for any future demands, they come at a cost of average performance since they conservatively optimize the worst cases. In cloud environments, applications might exhibit predictability on their future demands. In Part II, we leverage machine learning to enable resource managers to better predict future demands for improving resource efficiency. We propose two online-learning based solutions for two applications. First, we demonstrate the predictability of virtual machine resource usage and propose MadVM based on approximate Markov Decision Processes. Second, we study the uplink user scheduling problem in Heterogeneous Cellular Network (HetNets) and propose OLIUS to make scheduling decisions by adaptively learning the environment from scratch. We further build a system framework to schedule machine learning workloads, which is described in Part III. More specifically, we focus on managing multi-tenant deep learning clusters equipped with specialized accelerators, (e.g., GPUs). Simply retrofitting traditional resource management solutions could lead to severe sharing anomalies due to the uncertain and non-uniform resource demands from different tenants. We propose HiveD that guarantees a strict sharing safety condition so that users can behave as if they are using private clusters and without sacrificing resource utilization of shared clusters.	-
dc.language	eng	-
dc.publisher	The University of Hong Kong (Pokfulam, Hong Kong)	-
dc.relation.ispartof	HKU Theses Online (HKUTO)	-
dc.rights	The author retains all proprietary rights, (such as patent rights) and the right to use in future works.	-
dc.rights	This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.	-
dc.subject.lcsh	Cloud computing	-
dc.title	Resource management in cloud computing : algorithm and system co-design	-
dc.type	PG_Thesis	-
dc.description.thesisname	Doctor of Philosophy	-
dc.description.thesislevel	Doctoral	-
dc.description.thesisdiscipline	Computer Science	-
dc.description.nature	published_or_final_version	-
dc.date.hkucongregation	2020	-
dc.identifier.mmsid	991044242097603414	-

File Download

Supplementary

postgraduate thesis: Resource management in cloud computing : algorithm and system co-design

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats