Please use this identifier to cite or link to this item: https://hdl.handle.net/10356/184316
Title: Speak without leaks: a modular pipeline for data-level privacy-preserving utilization of large language models
Authors: Lee, Ci Hui
Keywords: Computer and Information Science
Issue Date: 2025
Publisher: Nanyang Technological University
Source: Lee, C. H. (2025). Speak without leaks: a modular pipeline for data-level privacy-preserving utilization of large language models. Final Year Project (FYP), Nanyang Technological University, Singapore. https://hdl.handle.net/10356/184316
Abstract: The widespread adoption of Large Language Models (LLMs) across domains has raised significant concerns about data privacy, particularly when fine-tuning these models on domain-specific or user-generated content that may contain sensitive information. This project addresses the challenge of preserving privacy during LLM fine-tuning by proposing a modular, data-centric pipeline that applies privacy-preserving transformations to training data or prompt before model utilization. Unlike techniques that require changes to model architecture or training algorithms, the proposed pipeline operates independently of the underlying model, making it suitable for black-box scenarios where model internals are inaccessible. The pipeline integrates a suite of privacy-preserving methods — including classical anonymization, format-preserving encryption (FPE), and local differential privacy (LDP) — to sanitize sensitive content at different levels. The Implementation covers key phases such as entity identification, data sanitization, preprocessing, and model fine-tuning. Experiments conducted on benchmark text classification tasks demonstrate the trade-offs between privacy protection and model utility, with evaluation metrics highlighting the impact of different sanitization strategies. This work contributes a practical and extensible framework for privacy-aware LLM deployment, offering insights into how organizations can responsibly fine-tune language models on sensitive data or query a third party black-box model with sensitive prompt without compromising compliance or exposing confidential information.
URI: https://hdl.handle.net/10356/184316
Schools: College of Computing and Data Science 
Fulltext Permission: restricted
Fulltext Availability: With Fulltext
Appears in Collections:CCDS Student Reports (FYP/IA/PA/PI)

Files in This Item:
File Description SizeFormat 
CCDS24-0415_FYP_report_Lee Ci Hui.pdf
  Restricted Access
3.24 MBAdobe PDFView/Open

Page view(s)

35
Updated on May 7, 2025

Download(s)

1
Updated on May 7, 2025

Google ScholarTM

Check

Items in DR-NTU are protected by copyright, with all rights reserved, unless otherwise indicated.