Human action recognition by embedding silhouettes and visual words.
Saghafi Khadem, Behrouz.
Date of Issue2013
School of Computer Engineering
Centre for Multimedia and Network Technology
With the availability of cheap video recording devices, fast internet access and huge storage spaces, the corpus of video that is accessible has grown tremendously over the last few years. Processing of these videos to achieve end-user tasks such as video retrieval, human-computer interaction (HCI), biometrics etc. require automatic understanding of content in the video. Human action recognition is one aspect of video understanding that is useful in surveillance, behavioral analysis and HCI. Although this problem has been studied for quite some years now, challenges still exist in terms of cluttered background, intra-class variance and inter-class similarity, occlusion etc. In this thesis, we propose three methods for action recognition. First, we propose a novel embedding for learning the manifold of human actions which is optimum based on spatio-temporal correlation distance (SCD) between sequences. Sequences of actions can be compared based on distances between frames. However comparison based on between-sequence distance is more efficient and effective. In particular, our proposed embedding minimizes sum of distances between intra-class sequences while maximizing sum of distances between inter-class points. Actions sequences are represented by key postures chosen equidistantly from a semantic period of action. The projected sequences are compared based on SCD or Hausdorff distance in a nearest neighbor framework. The method not only outperforms other dimension reduction methods but is comparable to the state of the art on three public datasets. Moreover it is robust to additive noise, occlusion, shape deformation and change in view point up to a large extent. Second, we proposed an approach for introducing semantic relations into the bag-of-words framework for recognizing human actions. In the standard bag-of-words framework, the features are clustered based on their appearances and not their semantic relations. We exploit Latent Semantic Models such as LSA and pLSA as well as Canonical Correlation Analysis to find a subspace in which visual words are more semantically distributed. We project the visual words into the computed space and apply k-means to obtain semantically meaningful clusters and use them as the semantic visual vocabulary which leads to more discriminative histograms for recognizing actions. Our proposed method gives promising results on the challenging KTH action dataset. Finally, we introduce a novel method for combining information from multiple viewpoints. Spatio-temporal features are extracted from each viewpoint and used in a bag-of-words framework. Two codebooks with different sizes are used to form the histograms. The similarity between computed histograms are captured by HIK kernel as well as RBF kernel with Chi-Square distance. Obtained kernels are linearly combined using proper weights which are learned through an optimization process. For more efficiency, a separate set of optimum weights are calculated for each binary SVM classifier. Our proposed method not only enables us to combine multiple views efficiently but also models the action in multiple spaces using the same features, thereby increasing performance. Several experiments are performed to show the efficiency of the framework as well as the constitutive parts. We have obtained the state of the art accuracy of 95.8% on the challenging IXMAS multi-view dataset.
DRNTU::Engineering::Computer science and engineering::Computer applications::Computer-aided engineering