Please use this identifier to cite or link to this item:
Title: Recognizing and predicting human actions with depth camera
Authors: Weng, Junwu
Keywords: Engineering::Electrical and electronic engineering
Issue Date: 2020
Publisher: Nanyang Technological University
Source: Weng, J. (2020). Recognizing and predicting human actions with depth camera. Doctoral thesis, Nanyang Technological University, Singapore.
Abstract: Understanding human behavior from videos is a very important task in computer vision community. It is a significant sub-branch of video analysis. Human behav- ior analysis is widely applied in many application scenarios like human-computer interaction, video surveillance, video retrieval and autonomous driving. Thanks to the development of commodity depth cameras, skeleton-based human behavior analysis has drawn considerable attention in the computer vision community re- cently. Skeleton action sequences are extracted from depth camera based on pose estimation algorithms or directly detected from motion capture devices. Compared with RGB-based action sequences, skeleton-based action instances are more sim- plified and semantic. However, the limitation is that there is less appearance and few scene information provided in skeletal data. How to design suitable and stable models to understand human behavior, both body action and hand gesture, using skeletal data is an interesting and challenging topic. To well understand human behavior through action sequence, two tasks are very important, namely the ac- tion recognition and the action prediction. In this thesis, four different models are proposed to handle these distinctive tasks. Due to the success of deep learning models in image recognition, most of the state- of-the-arts choose to utilize deep learning as the tool for skeleton-based action recognition. However, compared with images and videos which are composed of millions or billions of pixels, the skeleton is composed by only tens of joints which is thus of much less complexity than images and videos. For such a light-weight data, non-parametric models like Naive-Bayes Nearest Neighbor (NBNN) may be more suitable than the deep learning models with high complexity. In the first two works of this thesis, two robust NBNN-based models, ST-NBNN and ST-NBMIM, are proposed to characterize skeleton sequences. Besides, to better understand skeleton-based actions, the bilinear classifiers are adopted to identify both key tem- poral stages as well as spatial joints for action classification. Although only using a linear classifier, experiments on five benchmark datasets show that by combin- ing the strength of both non-parametric and parametric models, ST-NBNN and ST-NBMIM can achieve competitive performance compared with state-of-the-art results using sophisticated models such as deep learning. Moreover, by identifying key skeleton joints and temporal stages for each action class, the two NBNN-based models can capture the essential spatio-temporal patterns that play key roles of recognizing actions, which is not always achievable by using end-to-end models. When facing the large-scale skeleton data, the non-parametric model reaches its limitation, and the deep-learning-based models demonstrate their superior perfor- mance on dataset with large size. Meanwhile, human body movements exhibit spatial patterns among pose joints. It is thus of great importance to identify those motion patterns and avoid the non-informative joints, via identifying the key combinations of joints that matter for the recognition. Although key spatio- temporal patterns discovery has been explored previously for skeleton-based action recognition, the temporal dynamics modeling of key joint combinations is not well researched in the community. In the third work of this thesis, a CNN model is proposed to adaptively search key pose joints for each action sequence. The work utilizes the deep-learning technique to train a deformable CNN model to discover sample-related key spatio-temporal patterns for action recognition. This deformable convolution better utilizes the contextual joints for action and gesture recognition and is more robust to noisy joints. The proposed model is evaluated on three benchmark datasets and the experimental results show the effectiveness of introducing temporal dynamics modeling of key joint combinations into the skeleton-based action recognition. The goal of early action recognition is to predict action label when the sequence is partially observed. The existing methods treat the early action recognition task as sequential classification problems on different observation ratios of an action se- quence. Since these models are trained by differentiating positive categories from all negative classes, the diverse information of different negative categories is ignored, which we believe can be collected to help improve the recognition performance. In the last work of this thesis, a new direction, introducing category exclusion to early action recognition, is explored. The category exclusion is modeled as a mask operation on the classification probability output of a pre-trained early action recognition classifier. Specifically, policy-based reinforcement learning is utilized to train an agent. The agent generates a series of binary masks to exclude interfering negative categories during action execution and hence help improve the recogni- tion accuracy. The proposed method is evaluated on three benchmark recognition datasets, and it enhances the recognition accuracy consistently over all different observation ratios on the three datasets, where the accuracy improvements on the early stages are especially significant. In summary, this thesis demonstrates the superior performance of the proposed four methods, including the ST-NBNN, ST-NBMIM, Deformable Pose Traversal Convolution, and the Category Exclusion Agent, for the tasks of action recognition and action prediction of skeleton-based sequences. These four models are exten- sively evaluated on well-known benchmark datasets and the experimental results show the effectiveness of these models on their corresponding tasks.
DOI: 10.32657/10356/138384
Rights: This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0).
Fulltext Permission: open
Fulltext Availability: With Fulltext
Appears in Collections:EEE Theses

Files in This Item:
File Description SizeFormat 
thesis-final.pdfpdf version of the thesis10.58 MBAdobe PDFView/Open

Page view(s)

Updated on Jan 29, 2023

Download(s) 20

Updated on Jan 29, 2023

Google ScholarTM




Items in DR-NTU are protected by copyright, with all rights reserved, unless otherwise indicated.