TY - JOUR
T1 - Energy-performance modeling of speculative checkpointing for exascale systems
AU - Alfian Amrizal, Muhammad
AU - Uno, Atsuya
AU - Sato, Yukinori
AU - Takizawa, Hiroyuki
AU - Kobayashi, Hiroaki
N1 - Funding Information:
This research is partially supported by JST CREST An Evolutionary Approach to Construction of a Software Development Environment for Massively-Parallel Heterogeneous Systems and Grant-in-Aid for Challenging Exploratory Research #26540049. The authors would like to thank Prof. Kobayashi and Prof. Egawa of Tohoku University for their meaning-full discussion.
Publisher Copyright:
Copyright © 2017 The Institute of Electronics, Information and Communication Engineers.
PY - 2017/12
Y1 - 2017/12
N2 - Coordinated checkpointing is a widely-used checkpoint/restart protocol for fault-tolerance in large-scale HPC systems. However, this protocol will involve massive amounts of I/O concentration, resulting in considerably high checkpoint overhead and high energy consumption. This paper focuses on speculative checkpointing, a CPR mechanism that allows for temporal distribution of checkpointings to avoid I/O concentration. We propose execution time and energy models for speculative checkpointing, and investigate energy-performance characteristics when speculative checkpointing is adopted in exascale systems. Using these models, we study the benefit of speculative checkpointing over coordinated checkpointing under various realistic scenarios for exascale HPC systems. We show that, compared to coordinated checkpointing, speculative checkpointing can achieve up to a 11% energy reduction at the cost of a relatively-small increase in the execution time. In addition, a significant energy-performance trade-off is expected when the system scale exceeds 1.2 million nodes.
AB - Coordinated checkpointing is a widely-used checkpoint/restart protocol for fault-tolerance in large-scale HPC systems. However, this protocol will involve massive amounts of I/O concentration, resulting in considerably high checkpoint overhead and high energy consumption. This paper focuses on speculative checkpointing, a CPR mechanism that allows for temporal distribution of checkpointings to avoid I/O concentration. We propose execution time and energy models for speculative checkpointing, and investigate energy-performance characteristics when speculative checkpointing is adopted in exascale systems. Using these models, we study the benefit of speculative checkpointing over coordinated checkpointing under various realistic scenarios for exascale HPC systems. We show that, compared to coordinated checkpointing, speculative checkpointing can achieve up to a 11% energy reduction at the cost of a relatively-small increase in the execution time. In addition, a significant energy-performance trade-off is expected when the system scale exceeds 1.2 million nodes.
KW - Checkpoint/restart
KW - Coordinated checkpointing
KW - Energy consumption
KW - Exascale
KW - Execution time
KW - Performance model
KW - Speculative checkpointing
UR - http://www.scopus.com/inward/record.url?scp=85038383852&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85038383852&partnerID=8YFLogxK
U2 - 10.1587/transinf.2017PAP0002
DO - 10.1587/transinf.2017PAP0002
M3 - Article
AN - SCOPUS:85038383852
SN - 0916-8532
VL - E100D
SP - 2749
EP - 2760
JO - IEICE Transactions on Information and Systems
JF - IEICE Transactions on Information and Systems
IS - 12
ER -