TY - JOUR
T1 - Multi-FPGA accelerator for scalable stencil computation with constant memory bandwidth
AU - Sano, Kentaro
AU - Hatsuda, Yoshiaki
AU - Yamamoto, Satoru
PY - 2014/3
Y1 - 2014/3
N2 - Stencil computation is one of the important kernels in scientific computations. However, sustained performance is limited owing to restriction on memory bandwidth, especially on multicore microprocessors and graphics processing units (GPUs) because of their small operational intensity. In this paper, we present a custom computing machine (CCM), called a scalable streaming-array (SSA), for high-performance stencil computations with multiple field-programmable gate arrays (FPGAs). We design SSA based on a domain-specific programmable concept, where CCMs are programmable with the minimum functionality required for an algorithm domain. We employ a deep pipelining approach over successive iterations to achieve linear scalability for multiple devices with a constant memory bandwidth. Prototype implementation using nine FPGAs demonstrates good agreement with a performance model, and achieves 260 and 236 GFlop/s for 2D and 3D Jacobi computation, which are 87.4 and 83.9 percent of the peak, respectively, with a memory bandwidth of only 2.0 GB/s. We also evaluate the performance of SSA for state-of-the-art FPGAs.
AB - Stencil computation is one of the important kernels in scientific computations. However, sustained performance is limited owing to restriction on memory bandwidth, especially on multicore microprocessors and graphics processing units (GPUs) because of their small operational intensity. In this paper, we present a custom computing machine (CCM), called a scalable streaming-array (SSA), for high-performance stencil computations with multiple field-programmable gate arrays (FPGAs). We design SSA based on a domain-specific programmable concept, where CCMs are programmable with the minimum functionality required for an algorithm domain. We employ a deep pipelining approach over successive iterations to achieve linear scalability for multiple devices with a constant memory bandwidth. Prototype implementation using nine FPGAs demonstrates good agreement with a performance model, and achieves 260 and 236 GFlop/s for 2D and 3D Jacobi computation, which are 87.4 and 83.9 percent of the peak, respectively, with a memory bandwidth of only 2.0 GB/s. We also evaluate the performance of SSA for state-of-the-art FPGAs.
KW - custom computing machine
KW - FPGA
KW - high-performance computation
KW - Scalable streaming-array
KW - stencil computation
UR - http://www.scopus.com/inward/record.url?scp=84894533061&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84894533061&partnerID=8YFLogxK
U2 - 10.1109/TPDS.2013.51
DO - 10.1109/TPDS.2013.51
M3 - Article
AN - SCOPUS:84894533061
SN - 1045-9219
VL - 25
SP - 695
EP - 705
JO - IEEE Transactions on Parallel and Distributed Systems
JF - IEEE Transactions on Parallel and Distributed Systems
IS - 3
M1 - 6470606
ER -