Please use this identifier to cite or link to this item:
|Towards temporal sentence grounding in videos
|Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence
Engineering::Computer science and engineering::Computer applications
Engineering::Computer science and engineering::Computing methodologies::Document and text processing
Engineering::Computer science and engineering::Computing methodologies::Image processing and computer vision
|Nanyang Technological University
|Zhang, H. (2022). Towards temporal sentence grounding in videos. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/163788
|Temporal sentence grounding in videos (TSGV), a.k.a., natural language video localization (NLVL) or video moment retrieval (VMR), aims to retrieve a temporal moment (i.e., a fraction of a video) that semantically corresponds to a language query from an untrimmed video. Connecting computer vision and natural language, TSGV has drawn significant attention from researchers in both communities. Successful retrieval of temporal moments enables machines to understand and organize multimodal information in a systematic manner. Different from humans who can quickly identify temporal moments, which is semantically related to a given language query, using their inference-making ability and commonsense knowledge, machines do not have such intelligence. The main challenge is that machines require to understand the semantics of both video and language query, as well as the precise cross-modal reasoning between them. As video and language query are different modalities, the recognition and localization of temporal moments greatly depend on machine understanding of the input contents and interactions between them. In this thesis, we introduce several novel approaches to tackle the TSGV problem from a new perspective. First, we propose to formulate TSGV as a span-based question answering (QA) task by treating the input video as a text passage. Then we devise a video span localizing network (VSLNet), on top of a typical span-based QA framework, to address TSGV by considering the differences between TSGV and span-based QA. The proposed method demonstrates that adopting a span-based QA framework is a promising direction to solve TSGV, and superior performance is obtained on several benchmark datasets. Second, despite the promising performance achieved by VSLNet, we observe existing solutions, including VSLNet, only perform well on short videos, but fail to generalize on long videos. To address the issue of performance degradation on long videos, we extend VSLNet to VSLNet-L by applying a multi-scale split-and-concatenation strategy. VSLNet-L splits the untrimmed video into short clip segments and predicts which clip segment contains the target moment and suppresses the importance of other segments. Experimental results show that VSLNet-L well addresses the issue of performance degradation on long videos. Third, when evaluation metric becomes strict, the results of TSGV methods drop significantly. That is, the predicted moment boundaries cannot well fit the ground truth. Based on VSLNet, we investigate a sequence matching approach, which incorporates the concepts of named entity recognition (NER) to remedy moment boundary prediction errors. We first analyze the relationships between TSGV and NER and reveal that the moment boundary prediction of TSGV is a generalized entity boundary detection problem. This insight leads us to equip a NER-style boundary detection module and develop a more effective and efficient TSGV algorithm. Fourth, we analyze the annotation distributional bias in widely used datasets for TSGV. Existence of such bias “hints” a model to capture the statistical regularities of moment annotations. To address this issue, we propose two debiasing strategies, i.e., data debiasing and model debiasing, on top of VSLNet to “force” a TSGV model to focus on cross-modal reasoning for precise moment retrieval. Experimental results show that both strategies are effective in improving model generalization capability and suppressing the effects of bias. Finally, we study the video corpus moment retrieval (VCMR) task, which aims to retrieve a temporal moment from a collection of untrimmed and unsegmented videos. VCMR is an extension of the TSGV task, but it is more practical since VCMR does not hold the strict hypothesis that a video-query pair must be given. In this task, we first study the characteristics of two general frameworks for VCMR, where one framework is of high efficiency but inferior retrieval performance, while the other is of better performance but low efficiency. We then propose a retrieval and localization network with contrastive learning to remedy the contradiction between the efficiency and accuracy of existing approaches. All in all, despite TSGV having been established and investigated for years, this thesis contributes several key ideas to solve TSGV from different perspectives, i.e., from the view of span-based QA and NER in NLP. Besides, we propose to address the annotation distributional bias of TSGV and extend it to a more practical scenario. Meanwhile, we also shed light on a few potential directions for future work.
|School of Computer Science and Engineering
|This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0).
|Appears in Collections:
Updated on Feb 19, 2024
Updated on Feb 19, 2024
Items in DR-NTU are protected by copyright, with all rights reserved, unless otherwise indicated.