Please use this identifier to cite or link to this item: https://hdl.handle.net/10356/151040
Full metadata record
DC FieldValueLanguage
dc.contributor.authorHe, Suen_US
dc.date.accessioned2021-06-23T04:58:23Z-
dc.date.available2021-06-23T04:58:23Z-
dc.date.issued2021-
dc.identifier.citationHe, S. (2021). Language-guided visual retrieval. Master's thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/151040en_US
dc.identifier.urihttps://hdl.handle.net/10356/151040-
dc.description.abstractLanguage-guided Visual Retrieval (LGVR) is an important direction of cross-modality learning. It aims to retrieve or localize the objective message from the untrimmed visual information under the guidance of a linguistic description. In this thesis we study two popular sub-tasks of LGVR, one is Visual Grounding (VG) which aims to locate an object in the image, and the other is Natural Language Video Localization (NLVL) which aims to locate a targeted video clip from a long video span. For VG, we propose a novel modular network learning to match both the object’s symbolic feature and visual feature extracted by CNN with the linguistic information to achieve a better cross-modality alignment. Besides, a residual attention parser is raised to leverage the difficulty of understanding language expressions. For NLVL, we utilize the fine-grained semantic features of the sparse frames in the video. To organize the discrete features, we propose a network called Hybrid Graph Network to capture both the spatial and locally temporal relationships between objects in the frames and also apply semantically matching between objects and words. To model the long-span relationships between activities in the two modalities, we implement a temporal encoder based on the attentive model. Finally, we formulate the prediction as a binary classification task rather than regressing the specific boundaries. We conduct extensive experiments on popular datasets on the two tasks to validate the effectiveness of our proposed models.en_US
dc.language.isoenen_US
dc.publisherNanyang Technological Universityen_US
dc.rightsThis work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0).en_US
dc.subjectEngineering::Computer science and engineering::Computing methodologies::Image processing and computer visionen_US
dc.titleLanguage-guided visual retrievalen_US
dc.typeThesis-Master by Researchen_US
dc.contributor.supervisorLin Guoshengen_US
dc.contributor.schoolSchool of Computer Science and Engineeringen_US
dc.description.degreeMaster of Engineeringen_US
dc.identifier.doi10.32657/10356/151040-
dc.contributor.supervisoremailgslin@ntu.edu.sgen_US
item.fulltextWith Fulltext-
item.grantfulltextopen-
Appears in Collections:SCSE Theses
Files in This Item:
File Description SizeFormat 
Thesis.pdf2.67 MBAdobe PDFView/Open

Page view(s)

194
Updated on May 21, 2022

Download(s) 20

225
Updated on May 21, 2022

Google ScholarTM

Check

Altmetric


Plumx

Items in DR-NTU are protected by copyright, with all rights reserved, unless otherwise indicated.