Please use this identifier to cite or link to this item: https://hdl.handle.net/10356/183986
Title: Dynamic query routing for distributed edge chatbots on NVIDIA Jetson platforms
Authors: Ng, Mu Rong
Keywords: Computer and Information Science
Issue Date: 2025
Publisher: Nanyang Technological University
Source: Ng, M. R. (2025). Dynamic query routing for distributed edge chatbots on NVIDIA Jetson platforms. Final Year Project (FYP), Nanyang Technological University, Singapore. https://hdl.handle.net/10356/183986
Abstract: In the current age of cloud-based computing and the widespread adoption of large language models (LLMs), concerns regarding data privacy, security and efficiency have become largely prominent. Many widely used LLM solutions rely on centralized cloud infrastructures, which is particularly concerning for industries handling sensitive information, such as healthcare, finance, and government services, where strict data protection regulations apply. Deploying LLMs on local edge devices offers a compelling alternative, providing greater control over data while reducing reliance on external cloud services. However, running LLMs on edge devices presents challenges in terms of computational efficiency, power consumption, and latency. A potential solution lies in distributed LLM systems, where multiple edge devices work together to process queries based on their complexity, optimizing performance while maintaining security. This project explores a distributed LLM-based chatbot deployed across multiple Nvidia Jetson devices. The system is designed to dynamically route queries based on their complexity, ensuring that high-performance devices handle resource-intensive tasks while lightweight devices manage simpler queries. The goal is to strike a balance between latency and energy efficiency while maintaining local inference capabilities that prioritise privacy and security. The system utilizes the Nvidia Jetson AGX Orin for high-performance tasks and the Jetson Nano Developer Kit for simpler tasks, operating in a client-server model where each device acts as an independent server. The queries are assigned to the respective devices for processing based on a complexity threshold. Experiments confirmed expected theoretical outcomes—lower latency but higher power usage on the Orin, and the opposite on the Nano. However, further investigation and testing was conducted to determine the optimal threshold point for a balance in efficiency and performance for different use cases.
URI: https://hdl.handle.net/10356/183986
Schools: College of Computing and Data Science 
Fulltext Permission: restricted
Fulltext Availability: With Fulltext
Appears in Collections:CCDS Student Reports (FYP/IA/PA/PI)

Files in This Item:
File Description SizeFormat 
Ng Mu Rong_FYP Report_Final.pdf
  Restricted Access
4.42 MBAdobe PDFView/Open

Page view(s)

95
Updated on May 7, 2025

Google ScholarTM

Check

Items in DR-NTU are protected by copyright, with all rights reserved, unless otherwise indicated.