Optimization of cloud task processing with checkpoint-restart mechanism

Di, S; Robert, Y; Vivien, F; Kondo, D; Wang, CL; Cappello, F

File Download

re01.htm

Links for fulltext

(May Require Subscription)

Publisher Website: 10.1145/2503210.2503217
Scopus: eid_2-s2.0-84899679452
WOS: WOS:000345856900065

Supplementary

Citations:
- Scopus: 0
- Web of Science: 0
Appears in Collections:
- Computer Science: Conference papers

Conference Paper: Optimization of cloud task processing with checkpoint-restart mechanism

Title	Optimization of cloud task processing with checkpoint-restart mechanism
Authors	Di, S Robert, Y Vivien, F Kondo, D Wang, CL Cappello, F
Keywords	Cloud Computing Checkpoint-Restart Mechanism Optimal Checkpointing Interval Google BLCR
Issue Date	2013
Publisher	Association for Computing Machinery (ACM).
Citation	The 2013 International Conference for High Performance Computing, Networking, Storage and Analysis (SC13), Denver, CO., 17-21 November 2013. In Proceedings of SC13, 2013, article no. 64 How to Cite? DOI: http://dx.doi.org/10.1145/2503210.2503217
Abstract	In this paper, we aim at optimizing fault-tolerance techniques based on a checkpointing/restart mechanism, in the context of cloud computing. Our contribution is three-fold. (1) We derive a fresh formula to compute the optimal number of checkpoints for cloud jobs with varied distributions of failure events. Our analysis is not only generic with no assumption on failure probability distribution, but also attractively simple to apply in practice. (2) We design an adaptive algorithm to optimize the impact of checkpointing regarding various costs like checkpointing/restart overhead. (3) We evaluate our optimized solution in a real cluster environment with hundreds of virtual machines and Berkeley Lab Checkpoint/Restart tool. Task failure events are emulated via a production trace produced on a large-scale Google data center. Experiments confirm that our solution is fairly suitable for Google systems. Our optimized formula outperforms Young’s formula by 3-10 percent, reducing wallclock lengths by 50-100 seconds per job on average.
Persistent Identifier	http://hdl.handle.net/10722/191545
ISBN	978-1-4503-2378-9
ISI Accession Number ID	WOS:000345856900065

DC Field	Value	Language
dc.contributor.author	Di, S	en_US
dc.contributor.author	Robert, Y	en_US
dc.contributor.author	Vivien, F	en_US
dc.contributor.author	Kondo, D	en_US
dc.contributor.author	Wang, CL	en_US
dc.contributor.author	Cappello, F	en_US
dc.date.accessioned	2013-10-15T07:10:15Z	-
dc.date.available	2013-10-15T07:10:15Z	-
dc.date.issued	2013	en_US
dc.identifier.citation	The 2013 International Conference for High Performance Computing, Networking, Storage and Analysis (SC13), Denver, CO., 17-21 November 2013. In Proceedings of SC13, 2013, article no. 64	en_US
dc.identifier.isbn	978-1-4503-2378-9	-
dc.identifier.uri	http://hdl.handle.net/10722/191545	-
dc.description.abstract	In this paper, we aim at optimizing fault-tolerance techniques based on a checkpointing/restart mechanism, in the context of cloud computing. Our contribution is three-fold. (1) We derive a fresh formula to compute the optimal number of checkpoints for cloud jobs with varied distributions of failure events. Our analysis is not only generic with no assumption on failure probability distribution, but also attractively simple to apply in practice. (2) We design an adaptive algorithm to optimize the impact of checkpointing regarding various costs like checkpointing/restart overhead. (3) We evaluate our optimized solution in a real cluster environment with hundreds of virtual machines and Berkeley Lab Checkpoint/Restart tool. Task failure events are emulated via a production trace produced on a large-scale Google data center. Experiments confirm that our solution is fairly suitable for Google systems. Our optimized formula outperforms Young’s formula by 3-10 percent, reducing wallclock lengths by 50-100 seconds per job on average.	-
dc.language	eng	en_US
dc.publisher	Association for Computing Machinery (ACM).	-
dc.relation.ispartof	Proceedings of SC13: International Conference for High Performance Computing, Networking, Storage and Analysis	en_US
dc.subject	Cloud Computing	-
dc.subject	Checkpoint-Restart Mechanism	-
dc.subject	Optimal Checkpointing Interval	-
dc.subject	Google	-
dc.subject	BLCR	-
dc.title	Optimization of cloud task processing with checkpoint-restart mechanism	en_US
dc.type	Conference_Paper	en_US
dc.identifier.email	Wang, CL: clwang@cs.hku.hk	en_US
dc.identifier.authority	Wang, CL=rp00183	en_US
dc.description.nature	link_to_OA_fulltext	-
dc.identifier.doi	10.1145/2503210.2503217	-
dc.identifier.scopus	eid_2-s2.0-84899679452	-
dc.identifier.hkuros	225318	en_US
dc.identifier.isi	WOS:000345856900065	-
dc.publisher.place	United States	-

File Download

Links for fulltext

(May Require Subscription)

Supplementary

Conference Paper: Optimization of cloud task processing with checkpoint-restart mechanism

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats