Please use this identifier to cite or link to this item:
https://hdl.handle.net/10356/184482
Title: | Investigating large language model pruning techniques | Authors: | Cheng, Yixiao | Keywords: | Computer and Information Science | Issue Date: | 2025 | Publisher: | Nanyang Technological University | Source: | Cheng, Y. (2025). Investigating large language model pruning techniques. Master's thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/184482 | Project: | D-255-24251-07168 | Abstract: | In recent years, the rapid development of large language models (LLMs) has significantly advanced the performance of various natural language processing (NLP) tasks. However, this progress has also introduced challenges related to the high cost of training, deployment, and model transfer. Model pruning, as an effective optimization technique, reduces model size by eliminating a portion of the model parameters. Among various pruning methods, structured block pruning is widely adopted in the context of LLMs due to its pruning efficiency. Nevertheless, its typical approach of merely zeroing out weights limits the model's representational capacity and often leads to substantial performance degradation. To address this issue, we propose a novel optimization strategy that integrates structured block pruning with knowledge distillation under a sparse lazy-loading framework. Specifically, after structured block pruning, the remaining dense-format tensors are converted into sparse-format tensors following the Block Sparse Row (BSR) layout, effectively reducing the memory footprint required to store the pruned LLM. Subsequently, the pruned model—equipped with the sparse lazy-loading mechanism—is distilled using the original, unpruned model as the teacher. This distillation process aims to recover the language modeling and natural language reasoning capabilities of the pruned model. Empirical results demonstrate that although the distilled sparse lazy-loaded structured block-pruned model exhibits slightly lower compression ratios compared to its non-distilled counterpart, it achieves significantly better compression efficiency than the original, unpruned model. Moreover, it shows superior performance in both language modeling and natural language inference tasks relative to the non-distilled pruned model, approaching the performance of the original model. | URI: | https://hdl.handle.net/10356/184482 | Schools: | School of Electrical and Electronic Engineering | Fulltext Permission: | restricted | Fulltext Availability: | With Fulltext |
Appears in Collections: | EEE Theses |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
Investigating Large Language Model Pruning Techniques.pdf Restricted Access | 2.82 MB | Adobe PDF | View/Open |
Items in DR-NTU are protected by copyright, with all rights reserved, unless otherwise indicated.