Please use this identifier to cite or link to this item:
Title: Phishing email detection using machine learning
Authors: Goh, Ying Ting
Keywords: Engineering::Computer science and engineering
Issue Date: 2021
Publisher: Nanyang Technological University
Source: Goh, Y. T. (2021). Phishing email detection using machine learning. Final Year Project (FYP), Nanyang Technological University, Singapore.
Abstract: Phishing is an Internet fraud which deceives users to disclose sensitive information or click on malicious links using socially engineered email messages. This can lead to identity theft, data breaches and financial losses for the victims. As such, it is important for individuals and organisations, particularly the government sector, to safeguard against such threats in a timely manner. This project seeks to use transformers, which have achieved state-of-the-art performance in various natural language processing tasks, to automate the phishing email classification process. To do so, email textual data, email headers and Uniform Resource Locators (URLs) found in emails were extracted from an internal dataset. DistilBERT, DistilRoBERTa and XLNet models were then trained on email textual data, while traditional machine learning models like decision trees and random forests were trained on features extracted from email headers and email URLs. These models were then ensembled together over a logistic regression layer. It was found that DistilBERT, DistilRoBERTa and XLNet models achieved promising results in phishing email classification, mostly achieving Matthews Correlation Coefficient (MCC) scores of 85 – 87%. When ensembled together over a logistic regression layer, these models performed even better, achieving MCC scores of 86 – 87%. Random forests models were also found to perform the best in classifying header and URL data extracted from the emails. When augmented with the transformer models, the random forest models trained on the URL data performed the best, improving the MCC performance by 3 – 6%. This shows that augmenting transformer models with random forests models trained on URL data is a promising approach to phishing email classification. All in all, transformers can achieve good results when trained on email textual data to perform phishing email classification. When augmented with URL data, these models perform even better, allowing this to be a viable approach to automating the phishing email classification process.
Fulltext Permission: restricted
Fulltext Availability: With Fulltext
Appears in Collections:SCSE Student Reports (FYP/IA/PA/PI)

Files in This Item:
File Description SizeFormat 
Goh Ying Ting - FYP Final Report.pdf
  Restricted Access
1.64 MBAdobe PDFView/Open

Page view(s)

Updated on Jan 19, 2022

Download(s) 50

Updated on Jan 19, 2022

Google ScholarTM


Items in DR-NTU are protected by copyright, with all rights reserved, unless otherwise indicated.