TY - GEN
T1 - Scalable streaming-array of simple soft-processors for stencil computations with constant memory-bandwidth
AU - Sano, Kentaro
AU - Hatsuda, Yoshiaki
AU - Yamamoto, Satoru
PY - 2011
Y1 - 2011
N2 - Stencil computation is one of the important kernels in scientific computations, however, the sustained performance is limited by memory bandwidth especially on multi-core microprocessors and GPGPUs due to its small operationalintensity. In this paper, we propose a scalable streaming-array (SSA) of simple soft-processors for high-performance stencil computation on multiple FPGAs. The SSA architecture allows a multi-device system to have linear scalability of computing performance by deeply pipelining with a constant bandwidth of an external-memory. We present an array-structure of programmable cores optimized for stencil computations and formulate a performance model of pipelined execution on the array. For Jacobi computations, SSA implemented on nine Stratix III FPGAs with the memory bandwidth of only 2 GB/s achieves 260 GFlop/s, corresponding to 87.4 of its peak performance, at 1.3 GFlop/sW. We demonstrate that SSA provides almost linear speedup for larger than medium-sized computation as expected by the performance model. These high utilization and scalability show a big potential of custom computing on reconfigurable devices as a power-efficient and high-performance computing platform.
AB - Stencil computation is one of the important kernels in scientific computations, however, the sustained performance is limited by memory bandwidth especially on multi-core microprocessors and GPGPUs due to its small operationalintensity. In this paper, we propose a scalable streaming-array (SSA) of simple soft-processors for high-performance stencil computation on multiple FPGAs. The SSA architecture allows a multi-device system to have linear scalability of computing performance by deeply pipelining with a constant bandwidth of an external-memory. We present an array-structure of programmable cores optimized for stencil computations and formulate a performance model of pipelined execution on the array. For Jacobi computations, SSA implemented on nine Stratix III FPGAs with the memory bandwidth of only 2 GB/s achieves 260 GFlop/s, corresponding to 87.4 of its peak performance, at 1.3 GFlop/sW. We demonstrate that SSA provides almost linear speedup for larger than medium-sized computation as expected by the performance model. These high utilization and scalability show a big potential of custom computing on reconfigurable devices as a power-efficient and high-performance computing platform.
KW - computation computation
KW - FPGA
KW - High-performance stencil
KW - scalable streaming-array
UR - http://www.scopus.com/inward/record.url?scp=79958746229&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=79958746229&partnerID=8YFLogxK
U2 - 10.1109/FCCM.2011.12
DO - 10.1109/FCCM.2011.12
M3 - Conference contribution
AN - SCOPUS:79958746229
SN - 9780769543017
T3 - Proceedings - IEEE International Symposium on Field-Programmable Custom Computing Machines, FCCM 2011
SP - 234
EP - 241
BT - Proceedings - IEEE International Symposium on Field-Programmable Custom Computing Machines, FCCM 2011
T2 - 19th IEEE International Symposium on Field-Programmable Custom Computing Machines, FCCM 2011
Y2 - 1 May 2011 through 3 May 2011
ER -