Please use this identifier to cite or link to this item: https://hdl.handle.net/10356/153778
Title: FPGA acceleration of continual learning at the edge
Authors: Piyasena Gane Pathirannahelage Duvindu
Keywords: Engineering::Computer science and engineering::Hardware
Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence
Issue Date: 2021
Publisher: Nanyang Technological University
Source: Piyasena Gane Pathirannahelage Duvindu (2021). FPGA acceleration of continual learning at the edge. Master's thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/153778
Abstract: Edge AI systems are increasingly being adopted in a wide range of application domains. These systems typically deploy Convolutional Neural Network (CNN) models on edge devices to perform inference, while relying on the cloud for model training. This is due to the high computational and memory demands of conventional model training, which exceeds the capabilities of resource-constrained edge devices running on tight power budgets. The dependency on the cloud for training is not suitable in many applications, where new objects or environmental conditions different from the ones present during training, are frequently encountered. In such applications, continual learning of new knowledge on the edge device becomes a necessity to avoid performance bottlenecks due to round-trip communication delays, network connectivity, and the available bandwidth. In this thesis, we propose Field-Programmable Gate Array (FPGA) based accelerator architecture and optimization strategies for a new paradigm of machine learning algorithms that is capable of continual learning. The proposed methods will enable edge FPGA systems to perform on-device deep continual learning for object classification. Specifically, the proposed methods aim to achieve real-time learning on-device, while providing for a high degree of scalability to learn a large number of classes. We first propose a FPGA accelerator for a Self-Organizing Neural Network (SONN), that can perform class-incremental continual learning in a streaming manner when combined with a CNN. The SONN model performs unsupervised learning from embedding features extracted from the CNN model by dynamically growing neurons and connections. We introduce design optimization strategies and runtime scheduling techniques to optimize resource usage, latency, and energy consumption. Experimental results based on Core50 dataset for continuous object recognition from video sequences, demonstrated that the proposed FPGA architecture outperforms CPU and GPU based counterparts in terms of latency and power. However, the SONN model grows proportionally to the classes learnt, which limits its scalability to learn a large number of classes efficiently. Next, we propose a FPGA accelerator for a Streaming Linear Discriminant Analysis (SLDA) model to overcome the scalability limitations of SONN. Similar to the SONN, SLDA performs continual learning from embedding features extracted from a CNN in a streaming manner. SLDA is highly scalable for learning a large number of classes as the network does not grow dynamically to accommodate new knowledge. We propose several design and runtime optimizations to minimize resource usage, latency, and energy consumption. Additionally, we introduce a new variant of SLDA and discuss the accuracy-efficiency trade-offs using popular datasets for continual learning, CoRE50 and CUB200. The results demonstrate that the proposed SLDA accelerator outperforms CPU and GPU counterparts, in terms of latency and energy efficiency. Finally, we demonstrate a full on-chip deep continual learning pipeline on FPGA, by integrating the proposed SLDA accelerator with Xilinx DPU, a programmable CNN accelerator IP. The design is implemented on a Xilinx Zynq Ultrascale+ MPSoC. In order to overcome the large performance bottleneck due to the communication overhead between the ARM processing system (PS) and programmable logic (PL), we implemented a Linux device driver that facilitates efficient memory mapping between PS-PL. The experimental results based on CoRE50 dataset show that the proposed pipeline is capable of performing continual learning at nearly the same latency as the inference pipeline, with only a marginal increase in energy consumption. Our results clearly demonstrate the viability of deploying real-time deep continual learning on edge AI systems that are equipped with FPGA accelerators.
URI: https://hdl.handle.net/10356/153778
DOI: 10.32657/10356/153778
Rights: This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0).
Fulltext Permission: embargo_20231209
Fulltext Availability: With Fulltext
Appears in Collections:SCSE Theses

Files in This Item:
File Description SizeFormat 
Thesis_Duvindu_Meng_final.pdf
  Until 2023-12-09
4.55 MBAdobe PDFUnder embargo until Dec 09, 2023

Page view(s)

37
Updated on Jan 20, 2022

Google ScholarTM

Check

Altmetric


Plumx

Items in DR-NTU are protected by copyright, with all rights reserved, unless otherwise indicated.