File Download

There are no files associated with this item.

Supplementary

Conference Paper: HiveD: Sharing a GPU Cluster for Deep Learning with Guarantees

TitleHiveD: Sharing a GPU Cluster for Deep Learning with Guarantees
Authors
Issue Date2020
PublisherThe USENIX Association.
Citation
Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI '20), Banff, Alberta, Canada, 4-6 November 2020, p. 515-532 How to Cite?
AbstractDeep learning training on a shared GPU cluster is becoming a common practice. However, we observe severe sharing anomaly in production multi-tenant clusters where jobs in some tenants experience worse queuing delay than they would have in a private cluster with their allocated shares of GPUs. This is because tenants use quota, the number of GPUs, to reserve resources, whereas deep learning jobs often use GPUs with a desirable GPU affinity, which quota cannot guarantee. HiveD is the first framework to share a GPU cluster safely, so that such anomaly would never happen by design. In HiveD, each tenant reserves resources through a Virtual Private Cluster (VC), defined in terms of multi-level cell structures corresponding to different levels of GPU affinity in a cluster. This design allows HiveD to incorporate any existing schedulers within each VC to achieve their respective design goals while sharing the cluster safely. HiveD develops an elegant buddy cell allocation algorithm to ensure sharing safety by efficiently managing the dynamic binding of cells from VCs to those in a physical cluster. A straightforward extension of buddy cell allocation can further support low-priority jobs to scavenge the unused GPU resources to improve cluster utilization. With a combination of real deployment and trace-driven simulation, we show that: (i) sharing anomaly exists in three state-of-the-art deep learning schedulers, incurring extra queuing delay of up to 1,000 minutes; (ii) HiveD can incorporate these schedulers and eliminate the sharing anomaly in all of them, achieving separation of concerns that allows the schedulers to focus on their own scheduling goals without violating sharing safety.
Persistent Identifierhttp://hdl.handle.net/10722/293458
ISBN

 

DC FieldValueLanguage
dc.contributor.authorZhao, H-
dc.contributor.authorHAN, Z-
dc.contributor.authorYang, Z-
dc.contributor.authorZhang, Q-
dc.contributor.authorYang, F-
dc.contributor.authorZhou, L-
dc.contributor.authorYang, M-
dc.contributor.authorLau, FCM-
dc.contributor.authorWang, Y-
dc.contributor.authorXiong, Y-
dc.contributor.authorWang, B-
dc.date.accessioned2020-11-23T08:17:03Z-
dc.date.available2020-11-23T08:17:03Z-
dc.date.issued2020-
dc.identifier.citationProceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI '20), Banff, Alberta, Canada, 4-6 November 2020, p. 515-532-
dc.identifier.isbn9781939133199-
dc.identifier.urihttp://hdl.handle.net/10722/293458-
dc.description.abstractDeep learning training on a shared GPU cluster is becoming a common practice. However, we observe severe sharing anomaly in production multi-tenant clusters where jobs in some tenants experience worse queuing delay than they would have in a private cluster with their allocated shares of GPUs. This is because tenants use quota, the number of GPUs, to reserve resources, whereas deep learning jobs often use GPUs with a desirable GPU affinity, which quota cannot guarantee. HiveD is the first framework to share a GPU cluster safely, so that such anomaly would never happen by design. In HiveD, each tenant reserves resources through a Virtual Private Cluster (VC), defined in terms of multi-level cell structures corresponding to different levels of GPU affinity in a cluster. This design allows HiveD to incorporate any existing schedulers within each VC to achieve their respective design goals while sharing the cluster safely. HiveD develops an elegant buddy cell allocation algorithm to ensure sharing safety by efficiently managing the dynamic binding of cells from VCs to those in a physical cluster. A straightforward extension of buddy cell allocation can further support low-priority jobs to scavenge the unused GPU resources to improve cluster utilization. With a combination of real deployment and trace-driven simulation, we show that: (i) sharing anomaly exists in three state-of-the-art deep learning schedulers, incurring extra queuing delay of up to 1,000 minutes; (ii) HiveD can incorporate these schedulers and eliminate the sharing anomaly in all of them, achieving separation of concerns that allows the schedulers to focus on their own scheduling goals without violating sharing safety.-
dc.languageeng-
dc.publisherThe USENIX Association.-
dc.relation.ispartofThe 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI '20)-
dc.titleHiveD: Sharing a GPU Cluster for Deep Learning with Guarantees-
dc.typeConference_Paper-
dc.identifier.emailLau, FCM: fcmlau@cs.hku.hk-
dc.identifier.authorityLau, FCM=rp00221-
dc.identifier.hkuros319180-
dc.identifier.spage515-
dc.identifier.epage532-

Export via OAI-PMH Interface in XML Formats


OR


Export to Other Non-XML Formats