TY - GEN
T1 - NVCR
T2 - 25th IEEE International Parallel and Distributed Processing Symposium, Workshops and Phd Forum, IPDPSW 2011
AU - Nukada, Akira
AU - Takizawa, Hiroyuki
AU - Matsuoka, Satoshi
PY - 2011
Y1 - 2011
N2 - Today, CUDA is the de facto standard programming framework to exploit the computational power of graphics processing units (GPUs) to accelerate various kinds of applications. For efficient use of a large GPU-accelerated system, one important mechanism is checkpoint-restart that can be used not only to improve fault tolerance but also to optimize node/slot allocation by suspending a job on one node and migrating the job to another node. Although several checkpoint-restart implementations have been developed so far, they do not support CUDA applications or have some severe limitations for CUDA support. Hence, we present a checkpoint-restart library for CUDA that first deletes all CUDA resources before checkpointing and then restores them right after checkpointing. It is necessary to restore each memory chunk at the same memory address. To this end, we propose a novel technique that replays memory-related API calls. The library supports both CUDA runtime API and CUDA driver API. Moreover, the library is transparent to applications; it is not necessary to recompile the applications for checkpointing. This paper demonstrates that the proposed library can achieve checkpoint-restart of various applications at acceptable overheads, and the library also works for MPI applications such as HPL.
AB - Today, CUDA is the de facto standard programming framework to exploit the computational power of graphics processing units (GPUs) to accelerate various kinds of applications. For efficient use of a large GPU-accelerated system, one important mechanism is checkpoint-restart that can be used not only to improve fault tolerance but also to optimize node/slot allocation by suspending a job on one node and migrating the job to another node. Although several checkpoint-restart implementations have been developed so far, they do not support CUDA applications or have some severe limitations for CUDA support. Hence, we present a checkpoint-restart library for CUDA that first deletes all CUDA resources before checkpointing and then restores them right after checkpointing. It is necessary to restore each memory chunk at the same memory address. To this end, we propose a novel technique that replays memory-related API calls. The library supports both CUDA runtime API and CUDA driver API. Moreover, the library is transparent to applications; it is not necessary to recompile the applications for checkpointing. This paper demonstrates that the proposed library can achieve checkpoint-restart of various applications at acceptable overheads, and the library also works for MPI applications such as HPL.
UR - http://www.scopus.com/inward/record.url?scp=83455166682&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=83455166682&partnerID=8YFLogxK
U2 - 10.1109/IPDPS.2011.131
DO - 10.1109/IPDPS.2011.131
M3 - Conference contribution
AN - SCOPUS:83455166682
SN - 9780769543857
T3 - IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum
SP - 104
EP - 113
BT - 2011 IEEE International Symposium on Parallel and Distributed Processing, Workshops and Phd Forum, IPDPSW 2011
Y2 - 16 May 2011 through 20 May 2011
ER -