File Download
  Links for fulltext
     (May Require Subscription)
Supplementary

Conference Paper: HAMS: High Availability for Distributed Machine Learning Service Graphs

TitleHAMS: High Availability for Distributed Machine Learning Service Graphs
Authors
KeywordsMachine Learning
Distributed System
Fault Tolerance
GPU
Nondeterminism
Issue Date2020
PublisherIEEE. The Journal's web site is located at https://ieeexplore.ieee.org/xpl/conhome.jsp?punumber=1000192
Citation
2020 50th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), Valencia, Spain, 29 June-2 July 2020, p. 184-196 How to Cite?
AbstractMission-critical services often deploy multiple Machine Learning (ML) models in a distributed graph manner, where each model can be deployed on a distinct physical host. Practical fault tolerance for such ML service graphs should meet three crucial requirements: high availability (fast failover), low normal case performance overhead, and global consistency under non-determinism (e.g., threads in a GPU can do floating point additions in random order). Unfortunately, despite much effort, existing fault tolerance systems, including those taking the primary-backup approach or the checkpoint-replay approach, cannot meet all these three requirements. To tackle this problem, we present HAMS, which starts from the primary-backup approach to replicate each stateful ML model, and we leverage the causal logging technique from the checkpoint-replay approach to eliminate the notorious stop-and-buffer delay in the primary-backup approach. Extensive evaluation on 25 ML models and six ML services shows that: (1) in normal case, HAMS achieved 0.5%-3.7% overhead on latency compared with bare metal; (2) HAMS took 116.12ms-254.19ms to recover one stateful model in all services, 155.1X-1067.9X faster than a relevant system Lineage Stash (LS); and (3) HAMS recovered these services with global consistency even when the GPU non-determinism exists, not supported by LS. HAMS's code is released ongithub.com/hku-systems/hams.
Persistent Identifierhttp://hdl.handle.net/10722/289177
ISSN
ISI Accession Number ID

 

DC FieldValueLanguage
dc.contributor.authorZhao, S-
dc.contributor.authorChen, X-
dc.contributor.authorWang, C-
dc.contributor.authorLi, F-
dc.contributor.authorJi, Q-
dc.contributor.authorCui, H-
dc.contributor.authorLi, C-
dc.contributor.authorWang, S-
dc.date.accessioned2020-10-22T08:08:56Z-
dc.date.available2020-10-22T08:08:56Z-
dc.date.issued2020-
dc.identifier.citation2020 50th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), Valencia, Spain, 29 June-2 July 2020, p. 184-196-
dc.identifier.issn1530-0889-
dc.identifier.urihttp://hdl.handle.net/10722/289177-
dc.description.abstractMission-critical services often deploy multiple Machine Learning (ML) models in a distributed graph manner, where each model can be deployed on a distinct physical host. Practical fault tolerance for such ML service graphs should meet three crucial requirements: high availability (fast failover), low normal case performance overhead, and global consistency under non-determinism (e.g., threads in a GPU can do floating point additions in random order). Unfortunately, despite much effort, existing fault tolerance systems, including those taking the primary-backup approach or the checkpoint-replay approach, cannot meet all these three requirements. To tackle this problem, we present HAMS, which starts from the primary-backup approach to replicate each stateful ML model, and we leverage the causal logging technique from the checkpoint-replay approach to eliminate the notorious stop-and-buffer delay in the primary-backup approach. Extensive evaluation on 25 ML models and six ML services shows that: (1) in normal case, HAMS achieved 0.5%-3.7% overhead on latency compared with bare metal; (2) HAMS took 116.12ms-254.19ms to recover one stateful model in all services, 155.1X-1067.9X faster than a relevant system Lineage Stash (LS); and (3) HAMS recovered these services with global consistency even when the GPU non-determinism exists, not supported by LS. HAMS's code is released ongithub.com/hku-systems/hams.-
dc.languageeng-
dc.publisherIEEE. The Journal's web site is located at https://ieeexplore.ieee.org/xpl/conhome.jsp?punumber=1000192-
dc.relation.ispartofInternational Conference on Dependable Systems and Networks (DSN) Proceedings-
dc.rightsInternational Conference on Dependable Systems and Networks (DSN) Proceedings. Copyright © IEEE.-
dc.rights©2020 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.-
dc.subjectMachine Learning-
dc.subjectDistributed System-
dc.subjectFault Tolerance-
dc.subjectGPU-
dc.subjectNondeterminism-
dc.titleHAMS: High Availability for Distributed Machine Learning Service Graphs-
dc.typeConference_Paper-
dc.identifier.emailCui, H: heming@cs.hku.hk-
dc.identifier.authorityCui, H=rp02008-
dc.description.naturepostprint-
dc.identifier.doi10.1109/DSN48063.2020.00036-
dc.identifier.scopuseid_2-s2.0-85090414319-
dc.identifier.hkuros317118-
dc.identifier.spage184-
dc.identifier.epage196-
dc.identifier.isiWOS:000617924900016-
dc.publisher.placeUnited States-
dc.identifier.issnl1530-0889-

Export via OAI-PMH Interface in XML Formats


OR


Export to Other Non-XML Formats