Please use this identifier to cite or link to this item:
https://hdl.handle.net/10356/184140
Title: | ML and systems co-design for resource-efficient LLM inference serving | Authors: | Anand Chaanan Singh | Keywords: | Computer and Information Science | Issue Date: | 2025 | Publisher: | Nanyang Technological University | Source: | Anand Chaanan Singh (2025). ML and systems co-design for resource-efficient LLM inference serving. Final Year Project (FYP), Nanyang Technological University, Singapore. https://hdl.handle.net/10356/184140 | Abstract: | This project investigates the co-design of machine learning and systems-level techniques to enable resource-efficient inference serving of large language models (LLMs). We conduct a detailed analysis of the trade-offs between accuracy and performance (latency, memory, throughput) for a variety of model-level optimizations, including a modified version of Skeleton-of-Thought (SoT) prompting, prompt compression using LLMLingua, model quantization, and Key-Value Cache quantization. Through systematic experimentation and analytical modeling, we identify efficient configurations that minimize resource usage while maintaining output quality close to that of large models. In addition to understanding the efficiency-accuracy trade-offs of the earlier mentioned model-level optimizations, the key contributions of this project also include the implementation of system components such as parallel expansion handlers, compression routers, and output routers—designed to integrate the evaluated optimizations into an adaptive, modular serving system. These components are built on top of the vLLM engine to leverage modern serving capabilities like continuous batching and paged attention. Together, our findings and implementations lay the foundation for a complete LLM serving stack that can dynamically adapt to input characteristics and system load by leveraging various machine learning optimizations. The full integration of these components into a complete system remains ongoing work. | URI: | https://hdl.handle.net/10356/184140 | Schools: | College of Computing and Data Science | Fulltext Permission: | embargo_restricted_20270401 | Fulltext Availability: | With Fulltext |
Appears in Collections: | CCDS Student Reports (FYP/IA/PA/PI) |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
Anand Chaanan Singh - FYP Report.pdf Until 2027-04-01 | 1.51 MB | Adobe PDF | Under embargo until Apr 01, 2027 |
Items in DR-NTU are protected by copyright, with all rights reserved, unless otherwise indicated.