Please use this identifier to cite or link to this item: https://hdl.handle.net/10356/172603
Title: An efficient sparse LSTM accelerator on embedded FPGAs with bandwidth-oriented pruning
Authors: Li, Shiqing
Zhu, Shien
Luo, Xiangzhong
Luo, Tao
Liu, Weichen
Keywords: Engineering::Computer science and engineering
Engineering::Computer science and engineering::Hardware
Issue Date: 2023
Source: Li, S., Zhu, S., Luo, X., Luo, T. & Liu, W. (2023). An efficient sparse LSTM accelerator on embedded FPGAs with bandwidth-oriented pruning. 2023 33rd International Conference on Field-Programmable Logic and Applications (FPL), 42-48. https://dx.doi.org/10.1109/FPL60245.2023.00014
Project: MOE2019-T2-1-071 
NAP (M4082282/04INS000515C130) 
Conference: 2023 33rd International Conference on Field-Programmable Logic and Applications (FPL)
Abstract: Long short-term memory (LSTM) networks have been widely used in natural language processing applications. Although over 80% weights can be pruned to reduce the memory requirement with little accuracy loss, the pruned model still cannot be buffered on-chip for small embedded FPGAs. Considering that weights are stored in the off-chip DDR, the performance of LSTM is bounded by the available memory bandwidth. However, current pruning strategies did not consider bandwidth utilization and thus lead to bad performance in this situation. In this work, we propose an efficient sparse LSTM accelerator on embedded FPGAs with bandwidth-oriented pruning. The key idea is that data sequences can be compressed if items can be represented by a linear function of their indices in the sequences. Inspired by this idea, we first propose a column-wise pruning strategy that removes all the column indices and around 75% row indices of the remaining weights. Based on the strategy, we design a dedicated compressed format to fill the bandwidth. Further, we propose a fully pipelined hardware accelerator, which achieves the workload balance and shortens the critical path. Finally, we train the LSTM model using the TIMIT dataset and implement the accelerator on the Xilinx PYNQ-Z1 platform. The experimental result shows that our design achieves around 0.3% accuracy improvement, a 2.18x performance speedup, and a 1.96x power efficiency compared to the state-of-the-art work.
URI: https://hdl.handle.net/10356/172603
ISBN: 9798350341515
DOI: 10.1109/FPL60245.2023.00014
DOI (Related Dataset): 10.21979/N9/MTHKVG
Schools: School of Computer Science and Engineering 
Rights: © 2023 IEEE. All rights reserved. This article may be downloaded for personal use only. Any other use requires prior permission of the copyright holder. The Version of Record is available online at http://doi.org/10.1109/FPL60245.2023.00014.
Fulltext Permission: open
Fulltext Availability: With Fulltext
Appears in Collections:SCSE Conference Papers

Files in This Item:
File Description SizeFormat 
FPL2023.pdf387.92 kBAdobe PDFThumbnail
View/Open

Page view(s)

159
Updated on Mar 24, 2025

Download(s) 50

124
Updated on Mar 24, 2025

Google ScholarTM

Check

Altmetric


Plumx

Items in DR-NTU are protected by copyright, with all rights reserved, unless otherwise indicated.