Please use this identifier to cite or link to this item: https://hdl.handle.net/10356/179521
Title: Leveraging deep learning for visual understanding of videos
Authors: Tan, Clement Xian Ren
Keywords: Computer and Information Science
Issue Date: 2024
Publisher: Nanyang Technological University
Source: Tan, C. X. R. (2024). Leveraging deep learning for visual understanding of videos. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/179521
Abstract: In the field of Computer Vision (CV), the pursuit of human-level recognition and reasoning of visual scenes has been a long-standing aspiration. Over the past decade, significant contributions to the progress of CV have been made by deep learning, facilitated by the availability of big data and increased computational power. In CV, visual understanding encompasses not only the fundamental aspect of recognition, which involves identifying and categorizing objects and patterns within visual data, but also reasoning, which involves higher-level cognitive processes such as inferring relationships, predicting outcomes, and drawing meaningful conclusions based on the observed visual information. Human-like recognition capabilities across a variety of visual recognition tasks such as image classification, object detection, semantic segmentation, and instance segmentation, have been achieved by machines. The huge success of deep learning for visual recognition tasks has prompted researchers to tackle visual reasoning tasks, which are more challenging. The recognition and reasoning of objects within a scene is of paramount importance to applications such as human-robot collaborations and autonomous vehicles. This thesis aims at addressing the issues associated with current deep learning models for visual understanding and will focus on three tasks: (1) video object segmentation, (2) abductive action inference, and (3) action-conditioned scene graph prediction. Firstly, video object segmentation is an object tracking task that traditionally relies on supervised learning and necessitates extensive annotated datasets for deep learning model training. It is confronted with issues of repetitiveness and impracticality due to the manual labeling of these vast datasets. In addition, tracking objects within videos presents a unique challenge caused by appearance changes and occlusion across video frames. The work proposed in this thesis explores the potential of self-supervised learning for video object segmentation, aiming to harness freely available internet data such as YouTube videos for model training. Different from existing self-supervised learning approaches for video object segmentation that model pixel-to-pixel correspondence, this research shifts the focus towards modeling superpixel-to-superpixel correspondence. To achieve this, a novel approach which involves the tracking of superpixels between video frames through an attention mechanism trained within an end-to-end self-supervised framework is proposed. This approach aims to enhance the performance of existing self-supervised video object segmentation models by leveraging deep learning techniques and mitigating the limitations associated with supervised learning using large-scale annotated datasets. It also aims to enhance existing self-supervised models with the benefits of superpixels. Secondly, this thesis introduces the complex task of abductive action inference to assess the abductive reasoning abilities of deep learning models in comprehending visual scenes. A set of innovative object-relational models designed specifically for this task is presented and compared with existing state-of-the-art image, video, and vision-language models. These models are tasked with abducting the human-performed action that led to the scene depicted in an image or snapshot. The experiments demonstrate the potential of deep learning models in performing abductive action inference. This research is vital in advancing our understanding of the reasoning capacities of deep learning models, representing a significant stride toward attaining human-level reasoning capabilities in visual comprehension. Lastly, in action-conditioned scene graph prediction, the task involves predicting scene graph relations for a future state based on an initial state and an action. Instead of predicting visual representations for future states, each state is organized into a scene graph comprising human-object relationship triplets, thereby succinctly encapsulating the dynamics within the scene. Notably, this task remains underdeveloped in visual understanding research, with prior video scene graph generation approaches largely overlooking actions as a crucial signal. Actions applied to a preconditioned state generate an effect state embodying their anticipated outcomes. Human cognitive reasoning allows the inference of action consequences based on the initial scene context. To address this challenge, this thesis introduces the Action-conditioned Scene Graph dataset. Furthermore, this thesis proposes the AERT (Action-conditioned Effect Relational Transformer) model, designed to capture scene relations and action context to effectively predict future scene graph relations.
URI: https://hdl.handle.net/10356/179521
DOI: 10.32657/10356/179521
Schools: College of Computing and Data Science 
Organisations: A*STAR
Rights: This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0).
Fulltext Permission: embargo_20260807
Fulltext Availability: With Fulltext
Appears in Collections:CCDS Theses

Files in This Item:
File Description SizeFormat 
Clement_PhD_Thesis_final_revised_final.pdf
  Until 2026-08-07
32.07 MBAdobe PDFUnder embargo until Aug 07, 2026

Page view(s)

112
Updated on Oct 6, 2024

Google ScholarTM

Check

Altmetric


Plumx

Items in DR-NTU are protected by copyright, with all rights reserved, unless otherwise indicated.