Please use this identifier to cite or link to this item:
Title: Effective action recognition with fully supervised and self-supervised methods
Authors: Cao, Haozhi
Keywords: Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence
Engineering::Computer science and engineering::Computing methodologies::Pattern recognition
Issue Date: 2021
Publisher: Nanyang Technological University
Source: Cao, H. (2021). Effective action recognition with fully supervised and self-supervised methods. Master's thesis, Nanyang Technological University, Singapore.
Abstract: Action recognition in videos has attracted interest in computer vision and machine learning communities thanks to its applications such as surveillance and smart homes. In addition to spatial information in individual frames, videos contain temporal information across the temporal dimension. Therefore, effective spatio-temporal representation is the key to accurate action recognition in videos. Previous works have proposed various fully-supervised and self-supervised methods for video representation learning. For fully-supervised methods, most of them utilize convolution neural networks (CNNs) to extract spatial representation while temporal representation is usually modelled by pixel-wise correlations. However, it is inefficient to extract correlations between all pixels since some of them may relate to in-salient area (e.g. backgrounds or environments). On the other hand, self-supervised methods are proposed to leverage more accessible unlabled data on the Internet and transfer the extracted representation for different downstream tasks. The core of self-supervised methods is to design a pretext task where supervision signal is automatically generated based on characteristics of unlabeled data. Although self-supervised methods avoid the annotation of labeled data, compared to fully-supervised methods, there is room for performance improvement of self-supervised methods. In this paper, we address the above research gap with two novel deep learning methods, to advance fully-supervised and self-supervised methods, respectively. For fully-supervised learning, we propose a novel Key Point Shift Embedding Module (KPSEM) to adaptively extract channel-wise key point shifts across video frames without key point annotation for temporal feature extraction. Key points are adaptively extracted as feature points with maximum feature values at split regions, while key point shifts are the spatial displacements of corresponding key points. The key point shifts are encoded as the overall temporal features via linear embedding layers in a multi-set manner. To advance self-supervised learning, we propose a novel self-supervised learning method, called Video Incoherence Detection (VID), that leverages incoherence detection for spatio-temporal feature extraction. It roots from the observation that visual systems of human beings can easily identify video incoherence based on their comprehensive understanding of videos. Specifically, the training sample, denoted as the incoherent clip, is constructed by multiple sub-clips hierarchically sampled from the same raw video with various lengths of incoherence between each other. The network is trained to learn high-level representation by predicting the relative location and length of incoherence given the incoherent clip as input. Additionally, intra-video contrastive learning is introduced to maximize the mutual information between different incoherent clips from the same raw video. Our experiments show that both KPSEM and VID achieve state-of-the-art performance on action recognition with fully-supervised and self-supervised learning, respectively. Thorough ablation studies are also conducted to justify the performance of both proposed methods.
DOI: 10.32657/10356/152741
Rights: This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0).
Fulltext Permission: embargo_20231231
Fulltext Availability: With Fulltext
Appears in Collections:EEE Theses

Files in This Item:
File Description SizeFormat 
  Until 2023-12-31
4.91 MBAdobe PDFUnder embargo until Dec 31, 2023

Page view(s)

Updated on May 17, 2022


Updated on May 17, 2022

Google ScholarTM




Items in DR-NTU are protected by copyright, with all rights reserved, unless otherwise indicated.