Please use this identifier to cite or link to this item:
https://hdl.handle.net/10356/184058
Title: | Machine learning prediction of antibody heavy and light chain pairs using cross-attention | Authors: | Ng, Jovin Zu Wei | Keywords: | Computer and Information Science Medicine, Health and Life Sciences |
Issue Date: | 2025 | Publisher: | Nanyang Technological University | Source: | Ng, J. Z. W. (2025). Machine learning prediction of antibody heavy and light chain pairs using cross-attention. Final Year Project (FYP), Nanyang Technological University, Singapore. https://hdl.handle.net/10356/184058 | Project: | CCDS24-0531 | Abstract: | This work addresses the challenge of predicting compatible heavy-light chain pairs in antibodies using state-of-the-art computational approach. This study implements Word2Vec and FastText embedding models to generate vector representations of antibody sequences, combined with a cross-attention transformer to predict antibody chain pairs. Utilising convolutional embedding modules for feature extraction, multi- head attention layers for capturing chain interactions, and a feature integration network for final pair classification, our work aims to improve our ability to predict antibody pairing, potentially contributing to therapeutic antibody design and vaccine development down the line. Our model leverages a dataset of 1.3 million natively paired antibody sequences to not only generate negative samples, but also sufficiently train a data-hungry deep neural network. To ensure sequence diversity, a modified CD-HIT clustering algorithm which limited maximum cluster size with a 90% sequence identity threshold, was also employed as a final pre-processing step. By analysing various techniques to generate negative samples (variational autoencoders, motif-based generation, random shuffling and dissimilarity-based negative sampling), as well as comparing Word2Vec and FastText to determine their impact on model performance within a biological domain, an appropriate combination of techniques was employed to determine the model’s performance against a test dataset. Overall, our work produced fairly good results, with the final model achieving 0.97 accuracy, an F1-score of 0.964, and an AUC of 0.974 on the unseen test set. We demonstrated that FastText embeddings performed comparably to Word2Vec while offering better handling of out-of-vocabulary (OOV) sequences, and that dissimilarity- based negative sampling would likely produce higher quality training data than random shuffling. Our findings highlight the potential of cross-attention mechanisms for predicting antibody pairing, which could contribute to further downstream tasks such as antibody engineering applications. A simple web application was also developed as a proof-of-concept to increase ease and accessibility of computational tools to aid in biological tasks. | URI: | https://hdl.handle.net/10356/184058 | Schools: | College of Computing and Data Science | Research Centres: | Biomedical Informatics Lab | Fulltext Permission: | restricted | Fulltext Availability: | With Fulltext |
Appears in Collections: | CCDS Student Reports (FYP/IA/PA/PI) |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
Ng Zu Wei, Jovin_FYP Report_Amended.pdf Restricted Access | FYP Report | 2.44 MB | Adobe PDF | View/Open |
Items in DR-NTU are protected by copyright, with all rights reserved, unless otherwise indicated.