A Failure Prediction-Based Adaptive Checkpointing Method with Less Reliance on Temperature Monitoring for HPC Applications

Muhammad Alfian Amrizal, Pei Li, Mulya Agung, Ryusuke Egawa, Hiroyuki Takizawa

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

1 Citation (Scopus)

Abstract

Checkpointing with a constant checkpoint interval, a so-called constant checkpointing method, is commonly used in HPC field and has been proved to be the optimal solution for failures whose inter-arrival times are distributed exponentially. On the other hand, previous works have shown that there is a high correlation between processor temperature and its failure rate. By analyzing the results of the temperature monitoring on a parallel application, we noticed that the failure rate is dynamically changing and the failure inter-arrival times do not follow an exponential distribution. Under such a scenario, the constant checkpointing method is not the optimal solution and thus a checkpointing method with an adaptive checkpoint interval, called an adaptive checkpointing method, is required to achieve high performance. However, to use the adaptive method, the processor temperature must be constantly monitored in order to decide the timing for checkpointing. In this paper, we propose an adaptive checkpointing method with less reliance on the temperature monitoring. Our proposed method uses the timings of already occurred failures, called the prior failures, to estimate the mean time to failure (MTTF) of the next failure, called the posterior failure. The timing of the posterior failure is predicted based on the characteristic of a truncated Weibull distribution. The simulation results show that the proposed method can reduce the total wasted time compared to the constant checkpointing method with a considerably small temperature monitoring period.

Original languageEnglish
Title of host publicationProceedings - 2018 IEEE International Conference on Cluster Computing, CLUSTER 2018
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages515-523
Number of pages9
ISBN (Electronic)9781538683194
DOIs
Publication statusPublished - 2018 Oct 29
Event2018 IEEE International Conference on Cluster Computing, CLUSTER 2018 - Belfast, United Kingdom
Duration: 2018 Sept 102018 Sept 13

Publication series

NameProceedings - IEEE International Conference on Cluster Computing, ICCC
Volume2018-September
ISSN (Print)1552-5244

Conference

Conference2018 IEEE International Conference on Cluster Computing, CLUSTER 2018
Country/TerritoryUnited Kingdom
CityBelfast
Period18/9/1018/9/13

Keywords

  • Adaptive checkpointing
  • Checkpoint interval optimization
  • Checkpoint/restart
  • Processor temperature
  • Reliabilty
  • Temperature monitoring

Fingerprint

Dive into the research topics of 'A Failure Prediction-Based Adaptive Checkpointing Method with Less Reliance on Temperature Monitoring for HPC Applications'. Together they form a unique fingerprint.

Cite this