TY - GEN
T1 - An Application-Level Incremental Checkpointing Mechanism with Automatic Parameter Tuning
AU - Takizawa, Hiroyuki
AU - Amrizal, Muhammad Alfian
AU - Komatsu, Kazuhiko
AU - Egawa, Ryusuke
N1 - Funding Information:
This work was partially supported by JST CREST “An Evolutionary Approach to Construction of a Software Development Environment for Massively-Parallel Heterogeneous Systems,” DFG SPPEXA ExaFSA project, and Grant-in-Aid for Scientific Research(B) 16H02822.
Publisher Copyright:
© 2017 IEEE.
PY - 2018/4/23
Y1 - 2018/4/23
N2 - Although incremental checkpointing is an effective way of reducing the checkpointing overhead, it has been discussed mostly for system-level checkpointing. Since the whole memory space of a running application is saved in a checkpoint file, system-level checkpointing will be less practical for future-generation extreme-scale computing systems, in which the I/O operation is much more expensive than the computation, especially in terms of power consumption. In this work, hence, the idea of incremental checkpointing is applied to application-level checkpointing, in which programmers explicitly specify the simulation data to be saved into a checkpoint file so that only necessary data for resuming the simulation are saved. This work assumes that, in incremental checkpointing, a management region consisting of multiple memory pages is written to a checkpoint file only if any page in the management region has been updated since the last checkpointing. A management granularity is defined as the number of pages in a management region. A large granularity is likely to reduce the checkpointing overhead if a management region consists of only updated pages. However, if the granularity is too large, a management region will contain a lot of pages not updated since the last checkpointing, and thus incremental checkpointing cannot reduce the number of pages to be written into a checkpoint file. Therefore, this paper proposes an application-level incremental checkpointing mechanism with granularity autotuning for reducing the checkpointing overhead of a legacy simulation code.
AB - Although incremental checkpointing is an effective way of reducing the checkpointing overhead, it has been discussed mostly for system-level checkpointing. Since the whole memory space of a running application is saved in a checkpoint file, system-level checkpointing will be less practical for future-generation extreme-scale computing systems, in which the I/O operation is much more expensive than the computation, especially in terms of power consumption. In this work, hence, the idea of incremental checkpointing is applied to application-level checkpointing, in which programmers explicitly specify the simulation data to be saved into a checkpoint file so that only necessary data for resuming the simulation are saved. This work assumes that, in incremental checkpointing, a management region consisting of multiple memory pages is written to a checkpoint file only if any page in the management region has been updated since the last checkpointing. A management granularity is defined as the number of pages in a management region. A large granularity is likely to reduce the checkpointing overhead if a management region consists of only updated pages. However, if the granularity is too large, a management region will contain a lot of pages not updated since the last checkpointing, and thus incremental checkpointing cannot reduce the number of pages to be written into a checkpoint file. Therefore, this paper proposes an application-level incremental checkpointing mechanism with granularity autotuning for reducing the checkpointing overhead of a legacy simulation code.
KW - Application-level checkpinting
KW - incremental checkpointing
KW - parameter auto-tuning
UR - http://www.scopus.com/inward/record.url?scp=85050271999&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85050271999&partnerID=8YFLogxK
U2 - 10.1109/CANDAR.2017.96
DO - 10.1109/CANDAR.2017.96
M3 - Conference contribution
AN - SCOPUS:85050271999
T3 - Proceedings - 2017 5th International Symposium on Computing and Networking, CANDAR 2017
SP - 389
EP - 394
BT - Proceedings - 2017 5th International Symposium on Computing and Networking, CANDAR 2017
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 5th International Symposium on Computing and Networking, CANDAR 2017
Y2 - 19 November 2017 through 22 November 2017
ER -