TY - GEN
T1 - A QA-Assisted Job Scheduler for Minimizing the Impact of Urgent Computing on HPC System Operation
AU - Ohmura, Tatsuyoshi
AU - Takahashi, Keichi
AU - Egawa, Ryusuke
AU - Takizawa, Hiroyuki
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - In recent years, there has been an increase in large-scale natural disasters such as earthquakes, tsunamis, and storms, raising the importance of disaster prevention and mitigation. Thus, extensive studies on urgent computing use High Performance Computing (HPC) systems for rapid simulations and prompt countermeasures. To meet the deadlines of urgent jobs, job schedulers may have some features so that urgent jobs could have a higher priority of execution than other jobs. As a result, urgent job execution could have negative impacts on the execution of other jobs with lower priorities and, hence, on overall HPC system operation. The goal of this paper is to efficiently execute urgent jobs to meet their deadlines while minimizing the negative impacts on HPC system operation. This paper assumes that some of the running jobs can be suspended (and resumed later) to immediately execute an urgent job. Another important assumption is that there is a possibility that the suspended job could be terminated if necessary to meet the deadline, and thus its intermediate computation results are lost if the total memory usage of urgent and suspended jobs exceeds the memory capacity. The proposed method employs quantum annealing or quantum-inspired annealing (QA) techniques to find an appropriate combination of jobs to be suspended so as to minimize the loss of computational results while meeting the deadlines of urgent jobs. The evaluation results show that the proposed method can properly select an appropriate combination of jobs to be suspended so that it can minimize the computational losses. The results also demonstrate that the superiority of the proposed method becomes more remarkable in practical situations where the power-saving feature of HPC systems is enabled.
AB - In recent years, there has been an increase in large-scale natural disasters such as earthquakes, tsunamis, and storms, raising the importance of disaster prevention and mitigation. Thus, extensive studies on urgent computing use High Performance Computing (HPC) systems for rapid simulations and prompt countermeasures. To meet the deadlines of urgent jobs, job schedulers may have some features so that urgent jobs could have a higher priority of execution than other jobs. As a result, urgent job execution could have negative impacts on the execution of other jobs with lower priorities and, hence, on overall HPC system operation. The goal of this paper is to efficiently execute urgent jobs to meet their deadlines while minimizing the negative impacts on HPC system operation. This paper assumes that some of the running jobs can be suspended (and resumed later) to immediately execute an urgent job. Another important assumption is that there is a possibility that the suspended job could be terminated if necessary to meet the deadline, and thus its intermediate computation results are lost if the total memory usage of urgent and suspended jobs exceeds the memory capacity. The proposed method employs quantum annealing or quantum-inspired annealing (QA) techniques to find an appropriate combination of jobs to be suspended so as to minimize the loss of computational results while meeting the deadlines of urgent jobs. The evaluation results show that the proposed method can properly select an appropriate combination of jobs to be suspended so that it can minimize the computational losses. The results also demonstrate that the superiority of the proposed method becomes more remarkable in practical situations where the power-saving feature of HPC systems is enabled.
KW - Job Scheduling
KW - Power Saving
KW - Quantum Annealing
KW - Urgent Computing
KW - Urgent Job
UR - http://www.scopus.com/inward/record.url?scp=85216873700&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85216873700&partnerID=8YFLogxK
U2 - 10.1109/CANDARW64572.2024.00039
DO - 10.1109/CANDARW64572.2024.00039
M3 - Conference contribution
AN - SCOPUS:85216873700
T3 - Proceedings - 2024 12th International Symposium on Computing and Networking Workshops, CANDARW 2024
SP - 197
EP - 203
BT - Proceedings - 2024 12th International Symposium on Computing and Networking Workshops, CANDARW 2024
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 12th International Symposium on Computing and Networking Workshops, CANDARW 2024
Y2 - 26 November 2024 through 29 November 2024
ER -