Please use this identifier to cite or link to this item: https://hdl.handle.net/10356/184567
Title: JailGuard: a universal detection framework for prompt-based attacks on LLM systems
Authors: Zhang, Xiaoyu
Zhang, Cen
Li, Tianlin
Huang, Yihao
Jia, Xiaojun
Hu, Ming
Zhang, Jie
Liu, Yang
Ma, Shiqing
Shen, Chao
Keywords: Computer and Information Science
Issue Date: 2025
Source: Zhang, X., Zhang, C., Li, T., Huang, Y., Jia, X., Hu, M., Zhang, J., Liu, Y., Ma, S. & Shen, C. (2025). JailGuard: a universal detection framework for prompt-based attacks on LLM systems. ACM Transactions On Software Engineering and Methodolog. https://dx.doi.org/10.1145/3724393
Project: NCRP25-P04-TAICeN
AISG2-GC-2023-008
Journal: ACM Transactions on Software Engineering and Methodolog
Abstract: The systems and software powered by Large Language Models (LLMs) and Multi-Modal LLMs (MLLMs) have played a critical role in numerous scenarios. However, current LLM systems are vulnerable to prompt-based attacks, with jailbreaking attacks enabling the LLM system to generate harmful content, while hijacking attacks manipulate the LLM system to perform attacker-desired tasks, underscoring the necessity for detection tools. Unfortunately, existing detecting approaches are usually tailored to specific attacks, resulting in poor generalization in detecting various attacks across different modalities. To address it, we propose JailGuard, a universal detection framework deployed on top of LLM systems for prompt-based attacks across text and image modalities. JailGuard operates on the principle that attacks are inherently less robust than benign ones. Specifically, JailGuard mutates untrusted inputs to generate variants and leverages the discrepancy of the variants’ responses on the target model to distinguish attack samples from benign samples. We implement 18 mutators for text and image inputs and design a mutator combination policy to further improve detection generalization. The evaluation on the dataset containing 15 known attack types suggests that JailGuard achieves the best detection accuracy of 86.14%/82.90% on text and image inputs, outperforming state-of-the-art methods by 11.81%-25.73% and 12.20%-21.40%.
URI: https://hdl.handle.net/10356/184567
ISSN: 1049-331X
DOI: 10.1145/3724393
Schools: College of Computing and Data Science 
Rights: © 2025 Copyright held by the owner/author(s). All rights reserved. This article may be downloaded for personal use only. Any other use requires prior permission of the copyright holder. The Version of Record is available online at http://doi.org/10.1145/3724393.
Fulltext Permission: open
Fulltext Availability: With Fulltext
Appears in Collections:CCDS Journal Articles

Files in This Item:
File Description SizeFormat 
3724393.pdf1.85 MBAdobe PDFView/Open

Page view(s)

46
Updated on May 7, 2025

Download(s)

13
Updated on May 7, 2025

Google ScholarTM

Check

Altmetric


Plumx

Items in DR-NTU are protected by copyright, with all rights reserved, unless otherwise indicated.