Please use this identifier to cite or link to this item:
https://hdl.handle.net/10356/178423
Title: | Hardware-software co-exploration and optimization for next-generation learning machines | Authors: | Chen, Chunyun | Keywords: | Computer and Information Science | Issue Date: | 2024 | Publisher: | Nanyang Technological University | Source: | Chen, C. (2024). Hardware-software co-exploration and optimization for next-generation learning machines. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/178423 | Abstract: | In an era dominated by the rapid evolution of Machine Learning (ML), particularly Deep Learning (DL), the efficient deployment of learning algorithms on power- and area-constrained hardware remains a paramount challenge. The scaling up of DL models to trillions of parameters and trillions of computation operations, challenges the modest gains in energy efficiency and memory density derived from silicon scaling, making current DL hardware systems unsustainable. Therefore, this thesis delivers a comprehensive investigation into hardware-software co-design and optimization for next-generation learning machines, especially the co-design of both the special function hardware and the end-to-end full workload hardware, with detailed system-level impacts of DL hardware accelerators. The key design metrics for next-generation learning machines are energy efficiency, performance, and area overheads. To enable DL workloads running on resource-constrained hardware platforms, reducing the memory footprint is essential. One way to do this is through efficient entropy coding schemes. Commonly employed Fixed-to-Variable (F2V) entropy coding methods, e.g., Huffman coding and Arithmetic coding, are hardware-unfriendly, and cannot fully benefit from the resulting reduced memory requirement. We introduce adopting Tunstall coding — a Variable-to-Fixed (V2F) coding scheme, for DNN models compression and introduce two Tunstall decoders — the Logic Oriented and the Memory Oriented decoders, with up to a 20× decrease in memory usage and a 100× reduction in energy consumption compared to 32-bit DNNs. Furthermore, these decoders process data 3× to 6× faster than F2V coding schemes. Apart from the convolutional layer in Convolutional Neural Networks (CNNs) and the Multi-Head Attention (MHA) in Transformers, DL workloads also contain non-linear operations that are not easily parallelizable, posing a challenge in hardware implementations. One of which, Non-Maximum Suppression (NMS), is a critical step in object detection frameworks, and is a computational bottleneck when mapping the frameworks into the hardware system due to its computational intensity. Existing NMS optimizations don’t effectively parallelize on ASIC platforms. The introduced ShapoolNMS overcomes this limitation. Empowered by both low computation complexity and hardware/software co-optimization, ShapoolNMS is up to 42, 713× faster than conventional GreedyNMS software implementations. In this thesis, we also look beyond a single layer in the DL workloads, and introduce two end-to-end workload accelerators for the entire CNN-based and Transformer-based DL workloads, respectively. Current DL accelerators mainly target either the convolutional operations of CNNs or the Multi-Head Attention (MHA) in Transformers. The acceleration of the entire workload is less explored. This thesis introduces CNN-DLA, a Chiplet-based scalable hardware accelerator for CNN-based models with a showcase of ResNet-152, and ViTA for the ViT workload. With the introduced cross-layer optimization dataflow, CNN-DLA achieves significant memory requirement reductions by 84.85%, and the 44-Chiplet achieves a performance of 68 FPS on ResNet-152 with full-HD images as input. Similarly, ViTA reduces memory requirements by 40.5% and adaptable performance of 0.20-16.38 TOPS, with the area and power consumption of 2.00-6.79 mm2 and 0.22-10.40 W, respectively, making it suited for diverse applications. Additionally, we also provide detailed guidelines for integrating the introduced accelerators into a real hardware platform, i.e., the PULPissimo System-on-Chip (SoC) platform, including the interfaces, register map, and the finite state machine (FSM) of the integrated accelerators. Overall, this thesis provides a foundation for a scalable DL accelerator and the hardware-software co-design and co-exploration for learning machines. The introduced methods not only address current hardware limitations but also set a direction for sustainable and efficient DL hardware systems in the future. | URI: | https://hdl.handle.net/10356/178423 | DOI: | 10.32657/10356/178423 | Schools: | College of Computing and Data Science | Rights: | This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). | Fulltext Permission: | open | Fulltext Availability: | With Fulltext |
Appears in Collections: | CCDS Theses |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
PhD_Thesis_ChenChunyun-Final.pdf | Chen Chunyun PhD Thesis | 12.32 MB | Adobe PDF | View/Open |
Page view(s)
159
Updated on Oct 11, 2024
Download(s) 50
165
Updated on Oct 11, 2024
Google ScholarTM
Check
Altmetric
Items in DR-NTU are protected by copyright, with all rights reserved, unless otherwise indicated.