Please use this identifier to cite or link to this item: https://hdl.handle.net/10356/163261
Full metadata record
DC FieldValueLanguage
dc.contributor.authorXu, Liwenen_US
dc.date.accessioned2022-11-30T02:23:42Z-
dc.date.available2022-11-30T02:23:42Z-
dc.date.issued2022-
dc.identifier.citationXu, L. (2022). Deep metric based feature engineering to Improve document-level representation for document clustering. Master's thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/163261en_US
dc.identifier.urihttps://hdl.handle.net/10356/163261-
dc.description.abstractDocument-level representation attracts more and more research attention. Recent Transformer-based pretrained language models (PLMs) like BERT learn powerful textual representations. These models are originally and inherently designed for word-level tasks, which limits their maximum input length. Current document-level approaches accommodate this limitation through various ways. Some of them consider the concatenation of the title and the abstract only as the input to the PLM, which neglects the rich inherent semantic information within the main page. Other approaches try to obtain document-level representations by encoding multiple sentences in a document and concatenating them directly. However, the acquired representation may be too redundant, and the training and inference process are computationally heavy for real-world applications. To alleviate the two drawbacks, we decompose the process from word-level to document-level into a two-stage feature engineering. In the first stage, the sentence-level representations of each sentence in a document is extracted by a PLM from word-level tokens. Then they are concatenated into a document matrix. In the second stage, document matrixs with the semantic information of all text within documents are fed into a CNN model to obtain document-level representations with the dimension reduced 24 times. The model is optimized by a deep metric representation learning objective. Extensive experiments are conducted for hyper-parameter tuning and model design, and for the comparison among different deep metric representation learning objectives.en_US
dc.language.isoenen_US
dc.publisherNanyang Technological Universityen_US
dc.subjectEngineering::Computer science and engineering::Computing methodologies::Document and text processingen_US
dc.titleDeep metric based feature engineering to Improve document-level representation for document clusteringen_US
dc.typeThesis-Master by Courseworken_US
dc.contributor.supervisorLihui Chenen_US
dc.contributor.schoolSchool of Electrical and Electronic Engineeringen_US
dc.description.degreeMaster of Science (Signal Processing)en_US
dc.contributor.supervisoremailELHCHEN@ntu.edu.sgen_US
item.fulltextWith Fulltext-
item.grantfulltextrestricted-
Appears in Collections:EEE Theses
Files in This Item:
File Description SizeFormat 
Dissertation_Xu_Liwen.pdf
  Restricted Access
6.22 MBAdobe PDFView/Open

Page view(s)

178
Updated on Jun 15, 2024

Download(s)

16
Updated on Jun 15, 2024

Google ScholarTM

Check

Items in DR-NTU are protected by copyright, with all rights reserved, unless otherwise indicated.