Please use this identifier to cite or link to this item: https://hdl.handle.net/10356/182636
Title: Identifying software vulnerabilities via code representation learning
Authors: Wu, Bozhi
Keywords: Computer and Information Science
Issue Date: 2025
Publisher: Nanyang Technological University
Source: Wu, B. (2025). Identifying software vulnerabilities via code representation learning. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/182636
Abstract: In the era of big code, the proliferation of software applications has led to a corresponding increase in unidentified vulnerabilities, necessitating early detection during development to avoid potential disruptions and security risks post-deployment. Traditionally, static analysis tools, relying on manually defined rules, have been employed for early vulnerability detection but may struggle to adapt to new vulnerabilities. With the advancements in machine learning, learning-based approaches have emerged, leveraging automated code representation learning to identify vulnerabilities by learning from datasets. This thesis addresses several key challenges in vulnerability detection. Firstly, it discusses the need for comparative studies between learning-based approaches and static analysis tools in real-world scenarios to assess their respective strengths and limitations. Secondly, it highlights the challenge of developing precise code representations specifically tailored for vulnerability detection, emphasizing the limitations of current approaches in capturing nuanced vulnerability patterns across program semantics. Lastly, it explores the interpretability issues inherent in neural network-based approaches, which hinder understanding of why vulnerabilities are flagged. The thesis begins with an empirical study evaluating the effectiveness of existing methods and tools, providing insights into their practical application. Building upon this foundation, novel code representation techniques are introduced to improve the precision of identifying vulnerabilities in both commit and source code contexts. Specifically, E-SPI is introduced for security patch identification in commits, while SnapVuln enhances vulnerability detection in source code through dedicated slicing algorithms and a sophisticated GGNN with attention mechanisms. Additionally, the thesis introduces VulnSynth, an interpretable learning-based approach utilizing a domain-specific language based on LLVM IR and program synthesis techniques. VulnSynth synthesizes interpretable rules for vulnerability detection, overcoming the interpretability challenges of neural networks and providing clear explanations for vulnerability predictions. Extensive experiments demonstrate the efficacy of these advancements, showing superior performance of VulnSynth in detecting vulnerabilities compared to state-of-the-art approaches. In summary, this thesis contributes a comprehensive evaluation framework, novel code representation methodologies, and an interpretable approach for vulnerability detection, addressing critical challenges in the field and advancing the state-of-the-art in software security.
URI: https://hdl.handle.net/10356/182636
DOI: 10.32657/10356/182636
Schools: College of Computing and Data Science 
Rights: This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0).
Fulltext Permission: open
Fulltext Availability: With Fulltext
Appears in Collections:CCDS Theses

Files in This Item:
File Description SizeFormat 
thesis_wbz.pdf3.22 MBAdobe PDFView/Open

Page view(s)

158
Updated on May 7, 2025

Download(s) 50

139
Updated on May 7, 2025

Google ScholarTM

Check

Altmetric


Plumx

Items in DR-NTU are protected by copyright, with all rights reserved, unless otherwise indicated.