File Download
Links for fulltext
(May Require Subscription)
- Publisher Website: 10.1145/2503210.2503217
- Scopus: eid_2-s2.0-84899679452
- WOS: WOS:000345856900065
Supplementary
- Citations:
- Appears in Collections:
Conference Paper: Optimization of cloud task processing with checkpoint-restart mechanism
Title | Optimization of cloud task processing with checkpoint-restart mechanism |
---|---|
Authors | |
Keywords | Cloud Computing Checkpoint-Restart Mechanism Optimal Checkpointing Interval BLCR |
Issue Date | 2013 |
Publisher | Association for Computing Machinery (ACM). |
Citation | The 2013 International Conference for High Performance Computing, Networking, Storage and Analysis (SC13), Denver, CO., 17-21 November 2013. In Proceedings of SC13, 2013, article no. 64 How to Cite? |
Abstract | In this paper, we aim at optimizing fault-tolerance techniques based on a checkpointing/restart mechanism, in the context of cloud computing. Our contribution is three-fold. (1) We derive a fresh formula to compute the optimal number of checkpoints for cloud jobs with varied distributions of failure events. Our analysis is not only generic with no assumption on failure probability distribution, but also attractively simple to apply in practice. (2) We design an adaptive algorithm to optimize the impact of checkpointing regarding various costs like checkpointing/restart overhead. (3) We evaluate our optimized solution in a real cluster environment with hundreds of virtual machines and Berkeley Lab Checkpoint/Restart tool. Task failure events are emulated via a production trace produced on a large-scale Google data center. Experiments confirm that our solution is fairly suitable for Google systems. Our optimized formula outperforms Young’s formula by 3-10 percent, reducing wallclock lengths by 50-100 seconds per job on average. |
Persistent Identifier | http://hdl.handle.net/10722/191545 |
ISBN | |
ISI Accession Number ID |
DC Field | Value | Language |
---|---|---|
dc.contributor.author | Di, S | en_US |
dc.contributor.author | Robert, Y | en_US |
dc.contributor.author | Vivien, F | en_US |
dc.contributor.author | Kondo, D | en_US |
dc.contributor.author | Wang, CL | en_US |
dc.contributor.author | Cappello, F | en_US |
dc.date.accessioned | 2013-10-15T07:10:15Z | - |
dc.date.available | 2013-10-15T07:10:15Z | - |
dc.date.issued | 2013 | en_US |
dc.identifier.citation | The 2013 International Conference for High Performance Computing, Networking, Storage and Analysis (SC13), Denver, CO., 17-21 November 2013. In Proceedings of SC13, 2013, article no. 64 | en_US |
dc.identifier.isbn | 978-1-4503-2378-9 | - |
dc.identifier.uri | http://hdl.handle.net/10722/191545 | - |
dc.description.abstract | In this paper, we aim at optimizing fault-tolerance techniques based on a checkpointing/restart mechanism, in the context of cloud computing. Our contribution is three-fold. (1) We derive a fresh formula to compute the optimal number of checkpoints for cloud jobs with varied distributions of failure events. Our analysis is not only generic with no assumption on failure probability distribution, but also attractively simple to apply in practice. (2) We design an adaptive algorithm to optimize the impact of checkpointing regarding various costs like checkpointing/restart overhead. (3) We evaluate our optimized solution in a real cluster environment with hundreds of virtual machines and Berkeley Lab Checkpoint/Restart tool. Task failure events are emulated via a production trace produced on a large-scale Google data center. Experiments confirm that our solution is fairly suitable for Google systems. Our optimized formula outperforms Young’s formula by 3-10 percent, reducing wallclock lengths by 50-100 seconds per job on average. | - |
dc.language | eng | en_US |
dc.publisher | Association for Computing Machinery (ACM). | - |
dc.relation.ispartof | Proceedings of SC13: International Conference for High Performance Computing, Networking, Storage and Analysis | en_US |
dc.subject | Cloud Computing | - |
dc.subject | Checkpoint-Restart Mechanism | - |
dc.subject | Optimal Checkpointing Interval | - |
dc.subject | - | |
dc.subject | BLCR | - |
dc.title | Optimization of cloud task processing with checkpoint-restart mechanism | en_US |
dc.type | Conference_Paper | en_US |
dc.identifier.email | Wang, CL: clwang@cs.hku.hk | en_US |
dc.identifier.authority | Wang, CL=rp00183 | en_US |
dc.description.nature | link_to_OA_fulltext | - |
dc.identifier.doi | 10.1145/2503210.2503217 | - |
dc.identifier.scopus | eid_2-s2.0-84899679452 | - |
dc.identifier.hkuros | 225318 | en_US |
dc.identifier.isi | WOS:000345856900065 | - |
dc.publisher.place | United States | - |