TY - GEN
T1 - An automatic mpi process mapping method considering locality and memory congestion on numa systems
AU - Agung, Mulya
AU - Amrizal, Muhammad Alfian
AU - Egawa, Ryusuke
AU - Takizawa, Hiroyuki
N1 - Funding Information:
This research is partially supported by Grant-in-Aid for Scientific Research(B) #16H02822 and 17H01706. The first author is financially supported by the Data Science Program from the Ministry of Education, Culture, Sports, Science and Technology (MEXT) of Japan.
Publisher Copyright:
© 2019 IEEE.
PY - 2019/10
Y1 - 2019/10
N2 - MPI process mapping is an important step to achieve scalable performance on non-uniform memory access (NUMA) systems. Conventional approaches have focused only on improving the locality of communication. However, related studies have shown that on modern NUMA systems, the memory congestion problem could cause more severe performance degradation than the locality problem because a high number of processor cores in the systems can cause heavy congestion on shared caches and memory controllers. To optimize the process mapping, it is necessary to determine the communication behavior of the MPI processes. Previous methods rely on offline profiling to analyze the communication behavior, which incurs a high overhead and is potentially time-consuming. In this paper, we propose a method that automatically performs MPI process mapping for adapting to communication behaviors while considering both locality and memory congestion. Our method works at runtime during the execution of an MPI application. It does not require modifications to the application, previous knowledge of the communication behavior, or changes to the hardware and operating system. The proposed method has been evaluated with the NAS parallel benchmarks on a NUMA system. Experimental results show that our method can achieve performance close to an oracle-based mapping method with low overhead to the application execution. The performance improvement is up to 27.4% (13.4% on average) compared with the default mapping of the MPI runtime system.
AB - MPI process mapping is an important step to achieve scalable performance on non-uniform memory access (NUMA) systems. Conventional approaches have focused only on improving the locality of communication. However, related studies have shown that on modern NUMA systems, the memory congestion problem could cause more severe performance degradation than the locality problem because a high number of processor cores in the systems can cause heavy congestion on shared caches and memory controllers. To optimize the process mapping, it is necessary to determine the communication behavior of the MPI processes. Previous methods rely on offline profiling to analyze the communication behavior, which incurs a high overhead and is potentially time-consuming. In this paper, we propose a method that automatically performs MPI process mapping for adapting to communication behaviors while considering both locality and memory congestion. Our method works at runtime during the execution of an MPI application. It does not require modifications to the application, previous knowledge of the communication behavior, or changes to the hardware and operating system. The proposed method has been evaluated with the NAS parallel benchmarks on a NUMA system. Experimental results show that our method can achieve performance close to an oracle-based mapping method with low overhead to the application execution. The performance improvement is up to 27.4% (13.4% on average) compared with the default mapping of the MPI runtime system.
KW - Congestion
KW - Locality
KW - MPI
KW - Multi-core
KW - NUMA
KW - Process mapping
UR - http://www.scopus.com/inward/record.url?scp=85076145689&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85076145689&partnerID=8YFLogxK
U2 - 10.1109/MCSoC.2019.00010
DO - 10.1109/MCSoC.2019.00010
M3 - Conference contribution
AN - SCOPUS:85076145689
T3 - Proceedings - 2019 IEEE 13th International Symposium on Embedded Multicore/Many-Core Systems-on-Chip, MCSoC 2019
SP - 17
EP - 24
BT - Proceedings - 2019 IEEE 13th International Symposium on Embedded Multicore/Many-Core Systems-on-Chip, MCSoC 2019
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 13th IEEE International Symposium on Embedded Multicore/Many-Core Systems-on-Chip, MCSoC 2019
Y2 - 1 October 2019 through 4 October 2019
ER -