Please use this identifier to cite or link to this item:
https://hdl.handle.net/10356/184487
Title: | Unlearning harmful behaviors in language models: reproducing and mitigating jailbreak attacks | Authors: | Yi, Zimeng | Keywords: | Computer and Information Science Mathematical Sciences |
Issue Date: | 2025 | Publisher: | Nanyang Technological University | Source: | Yi, Z. (2025). Unlearning harmful behaviors in language models: reproducing and mitigating jailbreak attacks. Final Year Project (FYP), Nanyang Technological University, Singapore. https://hdl.handle.net/10356/184487 | Abstract: | Recent advances in large language models (LLMs) have prompted grow ing attention to model alignment, aiming to prevent the generation of harmful or unethical content. However, alignment remains vulnerable to adversarial attacks. In this work, we first demonstrate that popular open-source models, Mistral-7B and Vicuna-7B, can be manipulated to bypass alignment through carefully crafted adversarial prompts. These results highlight the practical limitations of current alignment techniques and motivate the need for robust mitigation strategies. Focusing on one such strategy, we investigate machine unlearning—the process of removing specific undesired behaviors from a trained model without full retraining. Due to computational constraints, we shift our experiments to GPT-2 and apply fine-tuning-based unlearning to erase the model’s sus ceptibility to previously successful adversarial attacks. We then re-evaluate the model’s responses under the same attack prompts. Our results show that fine-tuning can effectively reduce harmful generations, suggesting that targeted unlearning is a viable and lightweight defense mechanism. While current experiments are conducted on GPT-2, future work should explore scalable unlearning techniques applicable to larger models such as Mistral-7B and Vicuna-7B. | URI: | https://hdl.handle.net/10356/184487 | Schools: | School of Physical and Mathematical Sciences | Fulltext Permission: | restricted | Fulltext Availability: | With Fulltext |
Appears in Collections: | SPMS Student Reports (FYP/IA/PA/PI) |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
Unlearning Harmful Behaviors in LLMs.pdf Restricted Access | 1.36 MB | Adobe PDF | View/Open |
Items in DR-NTU are protected by copyright, with all rights reserved, unless otherwise indicated.