TY - GEN
T1 - Improving the scalability of transparent checkpointing for GPU computing systems
AU - Amrizal, Alfian
AU - Hirasawa, Shoichi
AU - Komatsu, Kazuhiko
AU - Takizawa, Hiroyuki
AU - Kobayashi, Hiroaki
PY - 2012
Y1 - 2012
N2 - As the number of nodes in a GPU computing system increases, checkpointing to a global file system becomes more time-consuming due to the I/O bottlenecks and network congestion. To solve this problem, in this paper, we propose a transparent and scalable checkpoint/restart mechanism for OpenCL applications, named Two-level CheCL. As its name implies, Two-level CheCL consists of two different checkpoint implementations, Local CheCL and Global CheCL. Local CheCL avoids checkpointing to the global file system by utilizing node's local storage. Our experimental results show that Local CheCL can accelerate the checkpointing process by up to four times faster than a conventional checkpointing mechanism. We also implement Global CheCL, which utilizes a global file system, to make sure that we always have a global checkpoint file even in the case of a catastrophic failure. We discuss the performance of our proposed mechanism through an analysis with a two-level checkpoint model.
AB - As the number of nodes in a GPU computing system increases, checkpointing to a global file system becomes more time-consuming due to the I/O bottlenecks and network congestion. To solve this problem, in this paper, we propose a transparent and scalable checkpoint/restart mechanism for OpenCL applications, named Two-level CheCL. As its name implies, Two-level CheCL consists of two different checkpoint implementations, Local CheCL and Global CheCL. Local CheCL avoids checkpointing to the global file system by utilizing node's local storage. Our experimental results show that Local CheCL can accelerate the checkpointing process by up to four times faster than a conventional checkpointing mechanism. We also implement Global CheCL, which utilizes a global file system, to make sure that we always have a global checkpoint file even in the case of a catastrophic failure. We discuss the performance of our proposed mechanism through an analysis with a two-level checkpoint model.
UR - http://www.scopus.com/inward/record.url?scp=84873980580&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84873980580&partnerID=8YFLogxK
U2 - 10.1109/TENCON.2012.6412343
DO - 10.1109/TENCON.2012.6412343
M3 - Conference contribution
AN - SCOPUS:84873980580
SN - 9781467348225
T3 - IEEE Region 10 Annual International Conference, Proceedings/TENCON
BT - IEEE TENCON 2012
T2 - 2012 IEEE Region 10 Conference: Sustainable Development Through Humanitarian Technology, TENCON 2012
Y2 - 19 November 2012 through 22 November 2012
ER -