Please use this identifier to cite or link to this item:
|Title:||Optimized data reuse via reordering for sparse matrix-vector multiplication on FPGAs||Authors:||Li, Shiqing
|Keywords:||Engineering::Computer science and engineering||Issue Date:||2021||Source:||Li, S., Liu, D. & Liu, W. (2021). Optimized data reuse via reordering for sparse matrix-vector multiplication on FPGAs. 2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD). https://dx.doi.org/10.1109/ICCAD51958.2021.9643453||Project:||MOE2019-T2-1-071
|Abstract:||Sparse matrix-vector multiplication (SpMV) is of paramount importance in both scientific and engineering applications. The main workload of SpMV is multiplications between randomly distributed nonzero elements in sparse matrices and their corresponding vector elements. Due to irregular data access patterns of vector elements and the limited memory bandwidth, the computational throughput of CPUs and GPUs is lower than the peak performance offered by FPGAs. FPGA’s large on-chip memory allows the input vector to be buffered on-chip and hence the off-chip memory bandwidth is only utilized to transfer the nonzero elements’ values, column indices, and row indices. Multiple nonzero elements are transmitted to FPGA and then their corresponding vector elements are accessed per cycle. However, typical on-chip block RAMs (BRAM) in FPGAs only have two access ports. The mismatch between off-chip memory bandwidth and on-chip memory ports stalls the whole engine, resulting in inefficient utilization of off-chip memory bandwidth. In this work, we reorder the nonzero elements to optimize data reuse for SpMV on FPGAs. The key observation is that since the vector elements can be reused for nonzero elements with the same column index, memory requests of these elements can be omitted by reusing the fetched data. Based on this observation, a novel compressed format is proposed to optimize data reuse by reordering the matrix’s nonzero elements. Further, to support the compressed format, we design a scalable hardware accelerator and implement it on the Xilinx UltraScale ZCU106 platform. We evaluate the proposed design with a set of matrices from the University of Florida sparse matrix collection. The experimental results show that the proposed design achieves an average 1.22x performance speedup w.r.t. the state-of-the-art work.||URI:||https://hdl.handle.net/10356/155570||ISBN:||9781665445078||DOI:||10.1109/ICCAD51958.2021.9643453||DOI (Related Dataset):||10.21979/N9/ATEYFB||Rights:||© 2021 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. The published version is available at: https://doi.org/10.1109/ICCAD51958.2021.9643453.||Fulltext Permission:||open||Fulltext Availability:||With Fulltext|
|Appears in Collections:||SCSE Conference Papers|
Updated on May 19, 2022
Updated on May 19, 2022
Items in DR-NTU are protected by copyright, with all rights reserved, unless otherwise indicated.