Please use this identifier to cite or link to this item: https://hdl.handle.net/10356/184487
Title: Unlearning harmful behaviors in language models: reproducing and mitigating jailbreak attacks
Authors: Yi, Zimeng
Keywords: Computer and Information Science
Mathematical Sciences
Issue Date: 2025
Publisher: Nanyang Technological University
Source: Yi, Z. (2025). Unlearning harmful behaviors in language models: reproducing and mitigating jailbreak attacks. Final Year Project (FYP), Nanyang Technological University, Singapore. https://hdl.handle.net/10356/184487
Abstract: Recent advances in large language models (LLMs) have prompted grow ing attention to model alignment, aiming to prevent the generation of harmful or unethical content. However, alignment remains vulnerable to adversarial attacks. In this work, we first demonstrate that popular open-source models, Mistral-7B and Vicuna-7B, can be manipulated to bypass alignment through carefully crafted adversarial prompts. These results highlight the practical limitations of current alignment techniques and motivate the need for robust mitigation strategies. Focusing on one such strategy, we investigate machine unlearning—the process of removing specific undesired behaviors from a trained model without full retraining. Due to computational constraints, we shift our experiments to GPT-2 and apply fine-tuning-based unlearning to erase the model’s sus ceptibility to previously successful adversarial attacks. We then re-evaluate the model’s responses under the same attack prompts. Our results show that fine-tuning can effectively reduce harmful generations, suggesting that targeted unlearning is a viable and lightweight defense mechanism. While current experiments are conducted on GPT-2, future work should explore scalable unlearning techniques applicable to larger models such as Mistral-7B and Vicuna-7B.
URI: https://hdl.handle.net/10356/184487
Schools: School of Physical and Mathematical Sciences 
Fulltext Permission: restricted
Fulltext Availability: With Fulltext
Appears in Collections:SPMS Student Reports (FYP/IA/PA/PI)

Files in This Item:
File Description SizeFormat 
Unlearning Harmful Behaviors in LLMs.pdf
  Restricted Access
1.36 MBAdobe PDFView/Open

Page view(s)

21
Updated on May 7, 2025

Download(s)

3
Updated on May 7, 2025

Google ScholarTM

Check

Items in DR-NTU are protected by copyright, with all rights reserved, unless otherwise indicated.