Please use this identifier to cite or link to this item: https://hdl.handle.net/10356/169151
Title: An efficient gustavson-based sparse matrix-matrix multiplication accelerator on embedded FPGAs
Authors: Li, Shiqing
Huai, Shuo
Liu, Weichen
Keywords: Engineering::Computer science and engineering
Engineering::Computer science and engineering::Hardware
Issue Date: 2023
Source: Li, S., Huai, S. & Liu, W. (2023). An efficient gustavson-based sparse matrix-matrix multiplication accelerator on embedded FPGAs. IEEE Transactions On Computer-Aided Design of Integrated Circuits and Systems. https://dx.doi.org/10.1109/TCAD.2023.3281719
Project: MOE2019-T2-1-071 
NAP (M4082282/04INS000515C130) 
Journal: IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 
Abstract: Sparse-matrix sparse-matrix multiplication (SpMM) is an important kernel in multiple areas, e.g., data analytics and machine learning. Due to the low on-chip memory requirement, the consistent data format, and the simplified control logic, the Gustavson’s algorithm is a promising backbone algorithm for SpMM on hardware accelerators. However, the off-chip memory traffic still limits the performance of the algorithm, especially on embedded FPGAs. Previous researchers optimize the Gustavson’s algorithm targeting high bandwidth memory-based architectures and their solutions cannot be directly applied to embedded FPGAs with traditional DDRs. In this work, we propose an efficient Gustavson-based sparse matrix-matrix multiplication accelerator on embedded FPGAs. The proposed design fully considers the feature of off-chip memory access on embedded FPGAs and the dataflow of the Gustavson’s algorithm. At first, we analyze the parallelism of the algorithm and propose to perform the algorithm with element-wise parallelism, which reduces the idle time of processing elements caused by synchronization. Further, we show a counter-intuitive example that the traditional cache leads to worse performance. Then, we propose a novel access pattern-aware cache scheme called SpCache, which provides quick responses to reduce bank conflicts caused by irregular memory accesses and combines streaming and caching to handle requests that access ordered elements of unpredictable length. Moreover, we propose to perform the merge on part of partial results, which removes some redundant merges in the naive implementation and has little postprocessing overhead. Finally, we conduct experiments on the Xilinx Zynq-UltraScale ZCU106 platform with a set of benchmarks from the SuiteSparse matrix collection. The experimental results show that the proposed design achieves an average 1.75x performance speedup compared to the baseline.
URI: https://hdl.handle.net/10356/169151
ISSN: 0278-0070
DOI: 10.1109/TCAD.2023.3281719
DOI (Related Dataset): 10.21979/N9/RKPHSM
Schools: School of Computer Science and Engineering 
Rights: © 2023 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. The published version is available at: https://doi.org/10.1109/TCAD.2023.3281719.
Fulltext Permission: open
Fulltext Availability: With Fulltext
Appears in Collections:SCSE Journal Articles

Files in This Item:
File Description SizeFormat 
TCAD3281719.pdf1.24 MBAdobe PDFThumbnail
View/Open

SCOPUSTM   
Citations 50

9
Updated on Mar 19, 2025

Page view(s)

184
Updated on Mar 23, 2025

Download(s) 10

413
Updated on Mar 23, 2025

Google ScholarTM

Check

Altmetric


Plumx

Items in DR-NTU are protected by copyright, with all rights reserved, unless otherwise indicated.