TY - GEN
T1 - A shared cache for a chip multi vector processor
AU - Musa, Akihiro
AU - Sato, Yoshiei
AU - Soga, Takashi
AU - Okabe, Koki
AU - Egawa, Ryusuke
AU - Takizawa, Hiroyuki
AU - Kobayashi, Hiroaki
PY - 2008
Y1 - 2008
N2 - This paper discusses the design of a chip multi vector processor (CMVP), especially examining the effects of an on-chip cache when the off-chip memory bandwidth is limited. As chip multiprocessors (CMPs) have become the mainstream in commodity scalar processors, the CMP architecture will be adopted to design of vector processors in the near future for harnessing a large number of transistors on a chip. To keep a higher sustained performance in execution of scientific and engineering applications, a vector processor (core) generally requires the ratio of the memory bandwidth to the arithmetic performance of at least 4 bytes/flop (B/FLOP). However, vector supercomputers have been encountering the memory wall problem due to the limited pin bandwidth. Therefore, we propose an on-chip shared cache to maintain the effective memory bandwidth for a CMVP. We evaluate the performance of the CMVP based on the NEC SX vector architecture using real scientific applications. Especially, we examine the caching effect on the sustained performance when the B/FLOP rate is decreased. The experimental results indicate that an 8 MB on-chip shared cache can improve the performance of a four-core CMVP by 15% to 40%, compared with that without the cache. This is because the shared cache can increase cache hit rates of multi-threads. Here, the shared cache employs a miss status handling registers, which has the potential for accelerating difference schemes in scientific and engineering applications. Moreover, we show that the 2 B/FLOP is enough for the CMVP to achieve a high scalability when the on-chip cache is employed.
AB - This paper discusses the design of a chip multi vector processor (CMVP), especially examining the effects of an on-chip cache when the off-chip memory bandwidth is limited. As chip multiprocessors (CMPs) have become the mainstream in commodity scalar processors, the CMP architecture will be adopted to design of vector processors in the near future for harnessing a large number of transistors on a chip. To keep a higher sustained performance in execution of scientific and engineering applications, a vector processor (core) generally requires the ratio of the memory bandwidth to the arithmetic performance of at least 4 bytes/flop (B/FLOP). However, vector supercomputers have been encountering the memory wall problem due to the limited pin bandwidth. Therefore, we propose an on-chip shared cache to maintain the effective memory bandwidth for a CMVP. We evaluate the performance of the CMVP based on the NEC SX vector architecture using real scientific applications. Especially, we examine the caching effect on the sustained performance when the B/FLOP rate is decreased. The experimental results indicate that an 8 MB on-chip shared cache can improve the performance of a four-core CMVP by 15% to 40%, compared with that without the cache. This is because the shared cache can increase cache hit rates of multi-threads. Here, the shared cache employs a miss status handling registers, which has the potential for accelerating difference schemes in scientific and engineering applications. Moreover, we show that the 2 B/FLOP is enough for the CMVP to achieve a high scalability when the on-chip cache is employed.
KW - chip multiprocessor
KW - memory system
KW - performance characterization
KW - scientific application
KW - vector cache
KW - vector processing
UR - http://www.scopus.com/inward/record.url?scp=77954455177&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=77954455177&partnerID=8YFLogxK
U2 - 10.1145/1509084.1509088
DO - 10.1145/1509084.1509088
M3 - Conference contribution
AN - SCOPUS:77954455177
SN - 9781605582436
T3 - Parallel Architectures and Compilation Techniques - Conference Proceedings, PACT
SP - 24
EP - 29
BT - Proceedings of the 9th MEDEA Workshop on MEmory Performance
T2 - 9th MEDEA Workshop on MEmory Performance: DEaling with Applications, Systems and Architecture, MEDEA '08, Held in Conjunction with the PACT 2008 Conference
Y2 - 26 October 2008 through 26 October 2008
ER -