Please use this identifier to cite or link to this item: https://hdl.handle.net/10356/163448
Title: Deep learning acceleration: from quantization to in-memory computing
Authors: Zhu, Shien
Keywords: Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence
Engineering::Computer science and engineering::Hardware::Arithmetic and logic structures
Engineering::Computer science and engineering::Computer systems organization::Special-purpose and application-based systems
Issue Date: 2022
Publisher: Nanyang Technological University
Source: Zhu, S. (2022). Deep learning acceleration: from quantization to in-memory computing. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/163448
Project: MOE2019-T2-1-071 
MOE2019-T1-001-072 
M4082282 
M4082087 
Abstract: Deep learning has demonstrated high accuracy and efficiency in various applications. For example, Convolutional Neural Networks (CNNs) widely adopted in Computer Vision (CV) and Transformers broadly applied in Natural Language Processing (NLP) are representative deep learning models. Deep learning models have grown deeper and larger in the past few years to obtain higher accuracy. Meanwhile, these deep learning models bring challenges to inference on the edge. These computational-intensive and memory-intensive deep learning models not only are bounded by limited computational resources but also suffer from the long latency and high energy of heavy memory access. Therefore, accelerating deep learning inference on the edge need software/hardware co-optimization. From the software perspective, thanks to the fault-tolerance nature of deep learning models, quantizing the 32-bit values to low-bitwidth values effectively reduces the model size and the computational complexity. Ternary and binary neural networks are representative quantized networks that achieve 16-32X model size reduction and up to 64X theoretical speedup. However, due to inefficient encoding and dot product, the ternary and binary low-bitwidth storage schemes and arithmetic operations are inefficient on Central Processing Unit (CPU) and Graphic Processing Unit (GPU) platforms. Existing ternary and binary encoding schemes are complex and incompatible. In addition, current ternary and binary dot products contain redundant operations, and mixed-precision ternary and binary dot products are missing. Among various deep learning models, Ternary Weight Network (TWN) and Adder Neural Network (AdderNet) are two other promising neural networks with higher accuracy than ternary and binary neural networks. Moreover, compared with integer quantization and full-precision models, TWN and AdderNet have a unique advantage: they replace the multiplication operations with lightweight addition and subtraction operations, which are favoured by In-Memory Computing (IMC) architectures. From the hardware perspective, IMC architectures compute inside the Non-Volatile Memory (NVM) arrays to reduce the data movement overhead. IMC architectures conduct addition and boolean operations in parallel, which is excellent for accelerating addition-centric deep learning models like TWNs and AdderNet. However, the addition and subtraction operators and data mapping schemes for deep learning models on existing IMC designs are not fully optimized. In this thesis, we accelerate deep learning inference from both software and hardware perspectives. Firstly, on the software side, we propose TAB to accelerate quantized ternary and binary deep learning models on the edge. First, we propose a unified value representation based on standard signed integer encoding. Second, we introduce a bitwidth-last data storage format to avoid the overhead of extracting the sign bit. Third, we propose ternary and binary bitwise dot products based on Gated-XOR, reducing 25% to 61% operations than State-Of-The-Art (SOTA) methods. Finally, we implement TAB on both CPU and GPU platforms as an open-source library with optimized bitwise kernels. Experiment results show that TAB's ternary and binary neural networks achieve up to 34.6X to 72.2X speedup than full-precision ones. Next, on the hardware side, we propose an in-memory accelerator FAT for TWNs with three contributions: a fast addition scheme that can avoid the time overhead of carry propagation and writing back, a sparse addition control unit utilizing the sparsity to skip operations on zero weights, and a combined-stationary data mapping to reduce the data movement and increase the parallelism across memory columns. Compared with SOTA IMC accelerators, FAT achieves 10.02X speedup and 12.19X energy efficiency on networks with 80% average sparsity. Last, we propose another in-memory accelerator iMAD for AdderNet. First, we co-optimize in-memory subtraction and addition operators to reduce the latency, energy, and sensing circuit area. Second, we design an accelerator architecture for AdderNet with high parallelism based on the optimized operators. Third, we propose an IMC-friendly computation pipeline for AdderNet convolution at the algorithm level to further boost the performance. Evaluation results show that our accelerator iMAD achieves 3.25X speedup and 3.55X energy efficiency compared with a SOTA in-memory accelerator. In summary, we accelerate deep learning models through software/hardware co-design. We propose a unified and optimized ternary and binary inference framework with unified encoding, optimized data storage, efficient bitwise dot product, and a programming library on existing CPU and GPU platforms. We further propose two hardware accelerators for TWNs and AdderNet with optimized operators, architectures, algorithms, and data mapping schemes on emerging in-memory computing platforms. In the future, we will extend the in-memory computing architectures to accelerate other types of deep learning models, for example, Transformers. We will also research general-purpose in-memory computing by integrating lightweight RISC-V CPU cores with computational memory arrays.
URI: https://hdl.handle.net/10356/163448
DOI: 10.32657/10356/163448
DOI (Related Dataset): 10.21979/N9/DYKUPV
10.21979/N9/RZ75BY
10.21979/N9/JNFW9P
10.21979/N9/XEH3D1
Schools: School of Computer Science and Engineering 
Organisations: Parallel and Distributed Computing Centre
Rights: This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0).
Fulltext Permission: open
Fulltext Availability: With Fulltext
Appears in Collections:SCSE Theses

Files in This Item:
File Description SizeFormat 
NTU_Thesis_Shien_Updated_on 2022-11-18 Signed.pdf6.65 MBAdobe PDFThumbnail
View/Open

Page view(s)

345
Updated on Feb 20, 2024

Download(s) 50

137
Updated on Feb 20, 2024

Google ScholarTM

Check

Altmetric


Plumx

Items in DR-NTU are protected by copyright, with all rights reserved, unless otherwise indicated.