TY - GEN
T1 - A Failure Prediction-Based Adaptive Checkpointing Method with Less Reliance on Temperature Monitoring for HPC Applications
AU - Amrizal, Muhammad Alfian
AU - Li, Pei
AU - Agung, Mulya
AU - Egawa, Ryusuke
AU - Takizawa, Hiroyuki
N1 - Funding Information:
ACKNOWLEDGMENT This work is supported by Grant-in-Aid for Scientific Research(B) 16H02822 and 17H01706.
Publisher Copyright:
© 2018 IEEE.
PY - 2018/10/29
Y1 - 2018/10/29
N2 - Checkpointing with a constant checkpoint interval, a so-called constant checkpointing method, is commonly used in HPC field and has been proved to be the optimal solution for failures whose inter-arrival times are distributed exponentially. On the other hand, previous works have shown that there is a high correlation between processor temperature and its failure rate. By analyzing the results of the temperature monitoring on a parallel application, we noticed that the failure rate is dynamically changing and the failure inter-arrival times do not follow an exponential distribution. Under such a scenario, the constant checkpointing method is not the optimal solution and thus a checkpointing method with an adaptive checkpoint interval, called an adaptive checkpointing method, is required to achieve high performance. However, to use the adaptive method, the processor temperature must be constantly monitored in order to decide the timing for checkpointing. In this paper, we propose an adaptive checkpointing method with less reliance on the temperature monitoring. Our proposed method uses the timings of already occurred failures, called the prior failures, to estimate the mean time to failure (MTTF) of the next failure, called the posterior failure. The timing of the posterior failure is predicted based on the characteristic of a truncated Weibull distribution. The simulation results show that the proposed method can reduce the total wasted time compared to the constant checkpointing method with a considerably small temperature monitoring period.
AB - Checkpointing with a constant checkpoint interval, a so-called constant checkpointing method, is commonly used in HPC field and has been proved to be the optimal solution for failures whose inter-arrival times are distributed exponentially. On the other hand, previous works have shown that there is a high correlation between processor temperature and its failure rate. By analyzing the results of the temperature monitoring on a parallel application, we noticed that the failure rate is dynamically changing and the failure inter-arrival times do not follow an exponential distribution. Under such a scenario, the constant checkpointing method is not the optimal solution and thus a checkpointing method with an adaptive checkpoint interval, called an adaptive checkpointing method, is required to achieve high performance. However, to use the adaptive method, the processor temperature must be constantly monitored in order to decide the timing for checkpointing. In this paper, we propose an adaptive checkpointing method with less reliance on the temperature monitoring. Our proposed method uses the timings of already occurred failures, called the prior failures, to estimate the mean time to failure (MTTF) of the next failure, called the posterior failure. The timing of the posterior failure is predicted based on the characteristic of a truncated Weibull distribution. The simulation results show that the proposed method can reduce the total wasted time compared to the constant checkpointing method with a considerably small temperature monitoring period.
KW - Adaptive checkpointing
KW - Checkpoint interval optimization
KW - Checkpoint/restart
KW - Processor temperature
KW - Reliabilty
KW - Temperature monitoring
UR - http://www.scopus.com/inward/record.url?scp=85057267799&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85057267799&partnerID=8YFLogxK
U2 - 10.1109/CLUSTER.2018.00067
DO - 10.1109/CLUSTER.2018.00067
M3 - Conference contribution
AN - SCOPUS:85057267799
T3 - Proceedings - IEEE International Conference on Cluster Computing, ICCC
SP - 515
EP - 523
BT - Proceedings - 2018 IEEE International Conference on Cluster Computing, CLUSTER 2018
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2018 IEEE International Conference on Cluster Computing, CLUSTER 2018
Y2 - 10 September 2018 through 13 September 2018
ER -