Please use this identifier to cite or link to this item:
https://hdl.handle.net/10356/172603
Title: | An efficient sparse LSTM accelerator on embedded FPGAs with bandwidth-oriented pruning | Authors: | Li, Shiqing Zhu, Shien Luo, Xiangzhong Luo, Tao Liu, Weichen |
Keywords: | Engineering::Computer science and engineering Engineering::Computer science and engineering::Hardware |
Issue Date: | 2023 | Source: | Li, S., Zhu, S., Luo, X., Luo, T. & Liu, W. (2023). An efficient sparse LSTM accelerator on embedded FPGAs with bandwidth-oriented pruning. 2023 33rd International Conference on Field-Programmable Logic and Applications (FPL), 42-48. https://dx.doi.org/10.1109/FPL60245.2023.00014 | Project: | MOE2019-T2-1-071 NAP (M4082282/04INS000515C130) |
Conference: | 2023 33rd International Conference on Field-Programmable Logic and Applications (FPL) | Abstract: | Long short-term memory (LSTM) networks have been widely used in natural language processing applications. Although over 80% weights can be pruned to reduce the memory requirement with little accuracy loss, the pruned model still cannot be buffered on-chip for small embedded FPGAs. Considering that weights are stored in the off-chip DDR, the performance of LSTM is bounded by the available memory bandwidth. However, current pruning strategies did not consider bandwidth utilization and thus lead to bad performance in this situation. In this work, we propose an efficient sparse LSTM accelerator on embedded FPGAs with bandwidth-oriented pruning. The key idea is that data sequences can be compressed if items can be represented by a linear function of their indices in the sequences. Inspired by this idea, we first propose a column-wise pruning strategy that removes all the column indices and around 75% row indices of the remaining weights. Based on the strategy, we design a dedicated compressed format to fill the bandwidth. Further, we propose a fully pipelined hardware accelerator, which achieves the workload balance and shortens the critical path. Finally, we train the LSTM model using the TIMIT dataset and implement the accelerator on the Xilinx PYNQ-Z1 platform. The experimental result shows that our design achieves around 0.3% accuracy improvement, a 2.18x performance speedup, and a 1.96x power efficiency compared to the state-of-the-art work. | URI: | https://hdl.handle.net/10356/172603 | ISBN: | 9798350341515 | DOI: | 10.1109/FPL60245.2023.00014 | DOI (Related Dataset): | 10.21979/N9/MTHKVG | Schools: | School of Computer Science and Engineering | Rights: | © 2023 IEEE. All rights reserved. This article may be downloaded for personal use only. Any other use requires prior permission of the copyright holder. The Version of Record is available online at http://doi.org/10.1109/FPL60245.2023.00014. | Fulltext Permission: | open | Fulltext Availability: | With Fulltext |
Appears in Collections: | SCSE Conference Papers |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
FPL2023.pdf | 387.92 kB | Adobe PDF | ![]() View/Open |
Page view(s)
159
Updated on Mar 24, 2025
Download(s) 50
124
Updated on Mar 24, 2025
Google ScholarTM
Check
Altmetric
Items in DR-NTU are protected by copyright, with all rights reserved, unless otherwise indicated.