File Download
Supplementary

postgraduate thesis: Resource management in cloud computing : algorithm and system co-design

TitleResource management in cloud computing : algorithm and system co-design
Authors
Advisors
Advisor(s):Lau, FCM
Issue Date2019
PublisherThe University of Hong Kong (Pokfulam, Hong Kong)
Citation
Han, Z. [韩震华]. (2019). Resource management in cloud computing : algorithm and system co-design. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
AbstractCloud computing has become a standard technology for the rapid delivery of computing services supporting a wide range of applications. Resource management, i.e., how to dispatch and schedule the networking and computing resources, plays a key role in improving the efficiency in resource provisioning. It faces two critical challenges: (1) uncertainty including uncertain user demands and uncertain resource quality (e.g., software/hardware failure, interference) which makes resource planning difficult; and (2) diverse application requirements, which push cloud service providers to have to understand and exploit the unique requirements of different applications. These challenges are not only faced by cloud data-centers, but also by edge computing, which is a new paradigm to provide low-latency access of cloud resources for edge applications. To tackle the above challenges calls for co-design of the algorithm and the underlying system, which is the main theme of this thesis. In Part I, we start with online and approximation solutions to deal with the uncertainty issue. Online algorithms are powerful methods that operate without any knowledge of future demands. Regardless of what future demands may arrive, online algorithms can always guarantee the performance by bounding the ratio between its performance and the offline optimum. We propose three online and approximation solutions tailored to three applications. First, we propose OnDisc, which is O(1/ε)-competitive with (1+ε) speed augmentation for job scheduling in edge-clouds. Second, we present Camul, which is O( log K )-competitive for cache management in edge-clouds (K is the total number of cache slots). Third, we propose SPIN for scheduling Bulk-Synchronous-Parallel (BSP) jobs, which is robust to estimation errors in job execution time. Although, the online and approximation solutions can guarantee performance for any future demands, they come at a cost of average performance since they conservatively optimize the worst cases. In cloud environments, applications might exhibit predictability on their future demands. In Part II, we leverage machine learning to enable resource managers to better predict future demands for improving resource efficiency. We propose two online-learning based solutions for two applications. First, we demonstrate the predictability of virtual machine resource usage and propose MadVM based on approximate Markov Decision Processes. Second, we study the uplink user scheduling problem in Heterogeneous Cellular Network (HetNets) and propose OLIUS to make scheduling decisions by adaptively learning the environment from scratch. We further build a system framework to schedule machine learning workloads, which is described in Part III. More specifically, we focus on managing multi-tenant deep learning clusters equipped with specialized accelerators, (e.g., GPUs). Simply retrofitting traditional resource management solutions could lead to severe sharing anomalies due to the uncertain and non-uniform resource demands from different tenants. We propose HiveD that guarantees a strict sharing safety condition so that users can behave as if they are using private clusters and without sacrificing resource utilization of shared clusters.
DegreeDoctor of Philosophy
SubjectCloud computing
Dept/ProgramComputer Science
Persistent Identifierhttp://hdl.handle.net/10722/283124

 

DC FieldValueLanguage
dc.contributor.advisorLau, FCM-
dc.contributor.authorHan, Zhenhua-
dc.contributor.author韩震华-
dc.date.accessioned2020-06-10T01:02:14Z-
dc.date.available2020-06-10T01:02:14Z-
dc.date.issued2019-
dc.identifier.citationHan, Z. [韩震华]. (2019). Resource management in cloud computing : algorithm and system co-design. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.-
dc.identifier.urihttp://hdl.handle.net/10722/283124-
dc.description.abstractCloud computing has become a standard technology for the rapid delivery of computing services supporting a wide range of applications. Resource management, i.e., how to dispatch and schedule the networking and computing resources, plays a key role in improving the efficiency in resource provisioning. It faces two critical challenges: (1) uncertainty including uncertain user demands and uncertain resource quality (e.g., software/hardware failure, interference) which makes resource planning difficult; and (2) diverse application requirements, which push cloud service providers to have to understand and exploit the unique requirements of different applications. These challenges are not only faced by cloud data-centers, but also by edge computing, which is a new paradigm to provide low-latency access of cloud resources for edge applications. To tackle the above challenges calls for co-design of the algorithm and the underlying system, which is the main theme of this thesis. In Part I, we start with online and approximation solutions to deal with the uncertainty issue. Online algorithms are powerful methods that operate without any knowledge of future demands. Regardless of what future demands may arrive, online algorithms can always guarantee the performance by bounding the ratio between its performance and the offline optimum. We propose three online and approximation solutions tailored to three applications. First, we propose OnDisc, which is O(1/ε)-competitive with (1+ε) speed augmentation for job scheduling in edge-clouds. Second, we present Camul, which is O( log K )-competitive for cache management in edge-clouds (K is the total number of cache slots). Third, we propose SPIN for scheduling Bulk-Synchronous-Parallel (BSP) jobs, which is robust to estimation errors in job execution time. Although, the online and approximation solutions can guarantee performance for any future demands, they come at a cost of average performance since they conservatively optimize the worst cases. In cloud environments, applications might exhibit predictability on their future demands. In Part II, we leverage machine learning to enable resource managers to better predict future demands for improving resource efficiency. We propose two online-learning based solutions for two applications. First, we demonstrate the predictability of virtual machine resource usage and propose MadVM based on approximate Markov Decision Processes. Second, we study the uplink user scheduling problem in Heterogeneous Cellular Network (HetNets) and propose OLIUS to make scheduling decisions by adaptively learning the environment from scratch. We further build a system framework to schedule machine learning workloads, which is described in Part III. More specifically, we focus on managing multi-tenant deep learning clusters equipped with specialized accelerators, (e.g., GPUs). Simply retrofitting traditional resource management solutions could lead to severe sharing anomalies due to the uncertain and non-uniform resource demands from different tenants. We propose HiveD that guarantees a strict sharing safety condition so that users can behave as if they are using private clusters and without sacrificing resource utilization of shared clusters. -
dc.languageeng-
dc.publisherThe University of Hong Kong (Pokfulam, Hong Kong)-
dc.relation.ispartofHKU Theses Online (HKUTO)-
dc.rightsThe author retains all proprietary rights, (such as patent rights) and the right to use in future works.-
dc.rightsThis work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.-
dc.subject.lcshCloud computing-
dc.titleResource management in cloud computing : algorithm and system co-design-
dc.typePG_Thesis-
dc.description.thesisnameDoctor of Philosophy-
dc.description.thesislevelDoctoral-
dc.description.thesisdisciplineComputer Science-
dc.description.naturepublished_or_final_version-
dc.date.hkucongregation2020-
dc.identifier.mmsid991044242097603414-

Export via OAI-PMH Interface in XML Formats


OR


Export to Other Non-XML Formats