Please use this identifier to cite or link to this item: https://hdl.handle.net/10356/168401
Title: Hardware and algorithm co-optimization for energy-efficient machine learning integrated circuits
Authors: Kim, Hyunjoon
Keywords: Engineering::Electrical and electronic engineering::Integrated circuits
Engineering::Electrical and electronic engineering::Semiconductors
Issue Date: 2022
Publisher: Nanyang Technological University
Source: Kim, H. (2022). Hardware and algorithm co-optimization for energy-efficient machine learning integrated circuits. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/168401
Project: NTU REF 2019-1528
Abstract: The future of computing faces a new challenge as the computing enhancements offered by the technology scaling alone cannot address the shortage of processing capability caused by the exponential growth of data generation. The traditional Von Neumann digital architecture struggles to perform while carrying out highly data-intensive, massively parallel operations such as deep neural network (DNN) and machine learning applications. High-speed multi-stream processors address the computing challenges by supplying raw computing power. However, deploying embedded hardware operating in the edge environment remained challenging. More specifically, in the edge environment where area and energy efficiencies are heavily emphasized, the communication bottleneck presented in the traditional architecture (i.e., Von Neumann) produces high energy consumption. To address this issue, we apply combined optimization techniques in VLSI circuit architecture and compute algorithm, effectively reducing the energy and area consumption caused by the data movements between the memory and the ALU. CIM architecture naturally adopts conventional SRAM bitcell operation to carry out both parameter storage and processing element roles. Despite its architectural advantages in efficiency, however, electrical issues associated with the memory bitcell array significantly degrade the applicability in commercial system hardware. Issues raised in the array implementation are analogous to traditional analog computing implementation, such as variation-induced non-linearity, limited operable dynamic range in the shared bit-line, A/D overhead, etc. This work provides a solution to the abovementioned issues by imposing a digital abstraction layer on analog signals and effectively addressing such concerns. In addition, by adopting the digital computing paradigm, our work presents further design flexibility through technology/voltage/frequency scalability and compute precision reconfigurability. We start the discussion by implementing computing-in-memory (CIM) architecture to first tackle the excessive energy consumption from data movement. Our first work, an SRAM-based CIM with pseudo-differential voltage-mode accumulators, was introduced. The design used BNN as the target DNN benchmark, and the macro was able to map 64x128 1b weights in its CIM bitcell array. Design features included reconfigurable 1-5b row-by-row ADC and residual non-linearity rejecting binary-searching based calibration scheme. Despite many features and design advantages, several concerns were raised. The proposed design could not fully address the variation-induced error, suffered from ADC overhead, and was only capable of handling low precision parameters. The design achieved 87 TOPS/W of maximum energy efficiency and 3.97 TOPS/mm2 area efficiency using 1b weights in a 128×128 array designed with 65nm CMOS. The second work, Colonnade, attempted to address the issues in the first design work. Colonnade is also an SRAM-based CIM. However, the macro adopted the digital computing paradigm to avoid many analog related problems. Colonnade implemented a digital CIM macro that does not have data conversion overhead, is robust towards variation and noise while supporting a wide range of reconfigurable parameter precisions and DNN model architectures with a scalable 128×128 digital bitcell array. The problem demonstrated in this work was having low memory density due to high computing hardware redundancy that was inevitable when the bitcell had two more computing blocks fused in. This design achieved 117.3 TOPS/W of maximum energy efficiency and 6.75 TOPS/mm2 area efficiency using 1b weights in a 128×128 digital bitcell array designed with 65nm CMOS. The introduction of the third work, which implements near-memory (NM) computing, somewhat alleviated the density issue. The proposed design used custom designed 7T SRAM for storing 1b weight, while sixteen such bitcells are grouped as a column MAC. A bit-serial compute block is placed in each column MAC to realize NM architecture. As a result, digital SRAM-based NM macro presented five times higher memory density than Colonnade, effectively resolving the hardware redundancy. The third design achieved 315-1.23 TOPS/W of energy efficiency and 4.3-0.270 TOPS/mm2 area efficiency using 1-16b weights in a 20×256 array built with 65nm CMOS. We acknowledged great potential in co-optimizing the design flow with the encoding scheme and the hardware design. Tensor-Train Decomposition algorithm and quantization techniques provided a significant amount of parameter data compression that can resolve the memory density issue that was not fully mitigated in previous works. Through the ongoing research, we devised a test chip to run DNN inference with orders of magnitude lower number of stored parameters without severe degradation in performance. This work is currently under test, and only a few preliminary results are available.
URI: https://hdl.handle.net/10356/168401
DOI: 10.32657/10356/168401
Schools: School of Electrical and Electronic Engineering 
Research Centres: VIRTUS, IC Design Centre of Excellence 
Rights: This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0).
Fulltext Permission: open
Fulltext Availability: With Fulltext
Appears in Collections:EEE Theses

Files in This Item:
File Description SizeFormat 
KimHyunjoon_Hardware_and_Algorithm_Co-optimization_for_energy-efficient_machine_learning_integrated_circuits.pdfThesis Archive3.06 MBAdobe PDFThumbnail
View/Open

Page view(s)

199
Updated on Jun 18, 2024

Download(s) 50

156
Updated on Jun 18, 2024

Google ScholarTM

Check

Altmetric


Plumx

Items in DR-NTU are protected by copyright, with all rights reserved, unless otherwise indicated.