Please use this identifier to cite or link to this item: https://hdl.handle.net/10356/184482
Title: Investigating large language model pruning techniques
Authors: Cheng, Yixiao
Keywords: Computer and Information Science
Issue Date: 2025
Publisher: Nanyang Technological University
Source: Cheng, Y. (2025). Investigating large language model pruning techniques. Master's thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/184482
Project: D-255-24251-07168 
Abstract: In recent years, the rapid development of large language models (LLMs) has significantly advanced the performance of various natural language processing (NLP) tasks. However, this progress has also introduced challenges related to the high cost of training, deployment, and model transfer. Model pruning, as an effective optimization technique, reduces model size by eliminating a portion of the model parameters. Among various pruning methods, structured block pruning is widely adopted in the context of LLMs due to its pruning efficiency. Nevertheless, its typical approach of merely zeroing out weights limits the model's representational capacity and often leads to substantial performance degradation. To address this issue, we propose a novel optimization strategy that integrates structured block pruning with knowledge distillation under a sparse lazy-loading framework. Specifically, after structured block pruning, the remaining dense-format tensors are converted into sparse-format tensors following the Block Sparse Row (BSR) layout, effectively reducing the memory footprint required to store the pruned LLM. Subsequently, the pruned model—equipped with the sparse lazy-loading mechanism—is distilled using the original, unpruned model as the teacher. This distillation process aims to recover the language modeling and natural language reasoning capabilities of the pruned model. Empirical results demonstrate that although the distilled sparse lazy-loaded structured block-pruned model exhibits slightly lower compression ratios compared to its non-distilled counterpart, it achieves significantly better compression efficiency than the original, unpruned model. Moreover, it shows superior performance in both language modeling and natural language inference tasks relative to the non-distilled pruned model, approaching the performance of the original model.
URI: https://hdl.handle.net/10356/184482
Schools: School of Electrical and Electronic Engineering 
Fulltext Permission: restricted
Fulltext Availability: With Fulltext
Appears in Collections:EEE Theses

Files in This Item:
File Description SizeFormat 
Investigating Large Language Model Pruning Techniques.pdf
  Restricted Access
2.82 MBAdobe PDFView/Open

Page view(s)

36
Updated on May 7, 2025

Download(s)

1
Updated on May 7, 2025

Google ScholarTM

Check

Items in DR-NTU are protected by copyright, with all rights reserved, unless otherwise indicated.