HAMS: High Availability for Distributed Machine Learning Service Graphs

Zhao, S; Chen, X; Wang, C; Li, F; Ji, Q; Cui, H; Li, C; Wang, S

File Download

Content.pdf

Links for fulltext

(May Require Subscription)

Publisher Website: 10.1109/DSN48063.2020.00036
Scopus: eid_2-s2.0-85090414319
WOS: WOS:000617924900016
Find via

Supplementary

Citations:
- Scopus: 0
- Web of Science: 0
Appears in Collections:
- Computer Science: Conference papers

Conference Paper: HAMS: High Availability for Distributed Machine Learning Service Graphs

Title	HAMS: High Availability for Distributed Machine Learning Service Graphs
Authors	Zhao, S Chen, X Wang, C Li, F Ji, Q Cui, H Li, C Wang, S
Keywords	Machine Learning Distributed System Fault Tolerance GPU Nondeterminism
Issue Date	2020
Publisher	IEEE. The Journal's web site is located at https://ieeexplore.ieee.org/xpl/conhome.jsp?punumber=1000192
Citation	2020 50th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), Valencia, Spain, 29 June-2 July 2020, p. 184-196 How to Cite? DOI: http://dx.doi.org/10.1109/DSN48063.2020.00036
Abstract	Mission-critical services often deploy multiple Machine Learning (ML) models in a distributed graph manner, where each model can be deployed on a distinct physical host. Practical fault tolerance for such ML service graphs should meet three crucial requirements: high availability (fast failover), low normal case performance overhead, and global consistency under non-determinism (e.g., threads in a GPU can do floating point additions in random order). Unfortunately, despite much effort, existing fault tolerance systems, including those taking the primary-backup approach or the checkpoint-replay approach, cannot meet all these three requirements. To tackle this problem, we present HAMS, which starts from the primary-backup approach to replicate each stateful ML model, and we leverage the causal logging technique from the checkpoint-replay approach to eliminate the notorious stop-and-buffer delay in the primary-backup approach. Extensive evaluation on 25 ML models and six ML services shows that: (1) in normal case, HAMS achieved 0.5%-3.7% overhead on latency compared with bare metal; (2) HAMS took 116.12ms-254.19ms to recover one stateful model in all services, 155.1X-1067.9X faster than a relevant system Lineage Stash (LS); and (3) HAMS recovered these services with global consistency even when the GPU non-determinism exists, not supported by LS. HAMS's code is released ongithub.com/hku-systems/hams.
Persistent Identifier	http://hdl.handle.net/10722/289177
ISSN	1530-0889
ISI Accession Number ID	WOS:000617924900016

DC Field	Value	Language
dc.contributor.author	Zhao, S	-
dc.contributor.author	Chen, X	-
dc.contributor.author	Wang, C	-
dc.contributor.author	Li, F	-
dc.contributor.author	Ji, Q	-
dc.contributor.author	Cui, H	-
dc.contributor.author	Li, C	-
dc.contributor.author	Wang, S	-
dc.date.accessioned	2020-10-22T08:08:56Z	-
dc.date.available	2020-10-22T08:08:56Z	-
dc.date.issued	2020	-
dc.identifier.citation	2020 50th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), Valencia, Spain, 29 June-2 July 2020, p. 184-196	-
dc.identifier.issn	1530-0889	-
dc.identifier.uri	http://hdl.handle.net/10722/289177	-
dc.description.abstract	Mission-critical services often deploy multiple Machine Learning (ML) models in a distributed graph manner, where each model can be deployed on a distinct physical host. Practical fault tolerance for such ML service graphs should meet three crucial requirements: high availability (fast failover), low normal case performance overhead, and global consistency under non-determinism (e.g., threads in a GPU can do floating point additions in random order). Unfortunately, despite much effort, existing fault tolerance systems, including those taking the primary-backup approach or the checkpoint-replay approach, cannot meet all these three requirements. To tackle this problem, we present HAMS, which starts from the primary-backup approach to replicate each stateful ML model, and we leverage the causal logging technique from the checkpoint-replay approach to eliminate the notorious stop-and-buffer delay in the primary-backup approach. Extensive evaluation on 25 ML models and six ML services shows that: (1) in normal case, HAMS achieved 0.5%-3.7% overhead on latency compared with bare metal; (2) HAMS took 116.12ms-254.19ms to recover one stateful model in all services, 155.1X-1067.9X faster than a relevant system Lineage Stash (LS); and (3) HAMS recovered these services with global consistency even when the GPU non-determinism exists, not supported by LS. HAMS's code is released ongithub.com/hku-systems/hams.	-
dc.language	eng	-
dc.publisher	IEEE. The Journal's web site is located at https://ieeexplore.ieee.org/xpl/conhome.jsp?punumber=1000192	-
dc.relation.ispartof	International Conference on Dependable Systems and Networks (DSN) Proceedings	-
dc.rights	International Conference on Dependable Systems and Networks (DSN) Proceedings. Copyright © IEEE.	-
dc.rights	©2020 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.	-
dc.subject	Machine Learning	-
dc.subject	Distributed System	-
dc.subject	Fault Tolerance	-
dc.subject	GPU	-
dc.subject	Nondeterminism	-
dc.title	HAMS: High Availability for Distributed Machine Learning Service Graphs	-
dc.type	Conference_Paper	-
dc.identifier.email	Cui, H: heming@cs.hku.hk	-
dc.identifier.authority	Cui, H=rp02008	-
dc.description.nature	postprint	-
dc.identifier.doi	10.1109/DSN48063.2020.00036	-
dc.identifier.scopus	eid_2-s2.0-85090414319	-
dc.identifier.hkuros	317118	-
dc.identifier.spage	184	-
dc.identifier.epage	196	-
dc.identifier.isi	WOS:000617924900016	-
dc.publisher.place	United States	-
dc.identifier.issnl	1530-0889	-

File Download

Links for fulltext

(May Require Subscription)

Supplementary

Conference Paper: HAMS: High Availability for Distributed Machine Learning Service Graphs

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats