Please use this identifier to cite or link to this item: https://hdl.handle.net/10356/150258
Title: Data-driven and NLP for long document learning representation
Authors: Ko, Seoyoon
Keywords: Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence
Engineering::Electrical and electronic engineering
Issue Date: 2021
Publisher: Nanyang Technological University
Source: Ko, S. (2021). Data-driven and NLP for long document learning representation. Final Year Project (FYP), Nanyang Technological University, Singapore. https://hdl.handle.net/10356/150258
Project: A3047-201
Abstract: Natural language processing (NLP) has been advancing at an incredible pace. However, research in long-document representation has not been deeply explored despite its importance. Semantic matching for long documents has many applications including citation and article recommendation. In an increasingly data-driven world today, these applications are becoming an integral part of our society. However, due to the length of long documents, the success in capturing long-document semantics is still a challenge. Recently, Siamese multi-depth attention-based hierarchical (SMASH) model was proposed which considers using document structure to capture semantics in a long document. In this project, a novel Siamese Hierarchical Weight Sharing Transformer (SHWEST) and Siamese Hierarchical Transformer (SHT) are proposed based on SMASH. These models aim to improve the long-document representations using the state-of-the-art transformer encoder architecture. There are three different document representations explored in this report, namely paragraph and sentence level (P+S), paragraph level (P), and sentence level (S). The report aims to determine how effective the different hierarchical document representation for both SHT and SHWEST in capturing the semantics of a long document. Experimental studies have been conducted to compare SHWEST and SHT against SMASH and RNN models on the AAN benchmark dataset. Experiments showed that the SHT and SHWEST models outperform all the baseline models including SMASH for all 3 representations and have better efficiency as it takes a shorter time for all 3 different combinations. Generally, P+S and S representations perform better than P representations. In particular, SHWEST (S) achieves 13.87% higher accuracy against the RNN model while SHWEST (P+S) achieves 12.89% higher accuracy. Moreover, SHWEST outperforms SHT in all aspects as well. SHT performed slightly better than SMASH, but it is more efficient with P+S being at least 2.8 times faster. Furthermore, both SHWEST and SHT have the potential to be further optimized when more computing resources are available.
URI: https://hdl.handle.net/10356/150258
Schools: School of Electrical and Electronic Engineering 
Fulltext Permission: restricted
Fulltext Availability: With Fulltext
Appears in Collections:EEE Student Reports (FYP/IA/PA/PI)

Files in This Item:
File Description SizeFormat 
Ko_Seoyoon-FYP.pdf
  Restricted Access
2.2 MBAdobe PDFView/Open

Page view(s)

334
Updated on May 7, 2025

Download(s) 50

51
Updated on May 7, 2025

Google ScholarTM

Check

Items in DR-NTU are protected by copyright, with all rights reserved, unless otherwise indicated.