Please use this identifier to cite or link to this item: https://hdl.handle.net/10356/184140
Title: ML and systems co-design for resource-efficient LLM inference serving
Authors: Anand Chaanan Singh
Keywords: Computer and Information Science
Issue Date: 2025
Publisher: Nanyang Technological University
Source: Anand Chaanan Singh (2025). ML and systems co-design for resource-efficient LLM inference serving. Final Year Project (FYP), Nanyang Technological University, Singapore. https://hdl.handle.net/10356/184140
Abstract: This project investigates the co-design of machine learning and systems-level techniques to enable resource-efficient inference serving of large language models (LLMs). We conduct a detailed analysis of the trade-offs between accuracy and performance (latency, memory, throughput) for a variety of model-level optimizations, including a modified version of Skeleton-of-Thought (SoT) prompting, prompt compression using LLMLingua, model quantization, and Key-Value Cache quantization. Through systematic experimentation and analytical modeling, we identify efficient configurations that minimize resource usage while maintaining output quality close to that of large models. In addition to understanding the efficiency-accuracy trade-offs of the earlier mentioned model-level optimizations, the key contributions of this project also include the implementation of system components such as parallel expansion handlers, compression routers, and output routers—designed to integrate the evaluated optimizations into an adaptive, modular serving system. These components are built on top of the vLLM engine to leverage modern serving capabilities like continuous batching and paged attention. Together, our findings and implementations lay the foundation for a complete LLM serving stack that can dynamically adapt to input characteristics and system load by leveraging various machine learning optimizations. The full integration of these components into a complete system remains ongoing work.
URI: https://hdl.handle.net/10356/184140
Schools: College of Computing and Data Science 
Fulltext Permission: embargo_restricted_20270401
Fulltext Availability: With Fulltext
Appears in Collections:CCDS Student Reports (FYP/IA/PA/PI)

Files in This Item:
File Description SizeFormat 
Anand Chaanan Singh - FYP Report.pdf
  Until 2027-04-01
1.51 MBAdobe PDFUnder embargo until Apr 01, 2027

Page view(s)

47
Updated on May 7, 2025

Google ScholarTM

Check

Items in DR-NTU are protected by copyright, with all rights reserved, unless otherwise indicated.