Please use this identifier to cite or link to this item: https://hdl.handle.net/10356/150982
Title: Action-stage emphasized spatiotemporal VLAD for video action recognition
Authors: Tu, Zhigang
Li, Hongyan
Zhang, Dejun
Dauwels, Justin
Li, Baoxin
Yuan, Junsong
Keywords: Engineering::Electrical and electronic engineering
Issue Date: 2019
Source: Tu, Z., Li, H., Zhang, D., Dauwels, J., Li, B. & Yuan, J. (2019). Action-stage emphasized spatiotemporal VLAD for video action recognition. IEEE Transactions On Image Processing, 28(6), 2799-2812. https://dx.doi.org/10.1109/TIP.2018.2890749
Journal: IEEE Transactions on Image Processing 
Abstract: Despite outstanding performance in image recognition, convolutional neural networks (CNNs) do not yet achieve the same impressive results on action recognition in videos. This is partially due to the inability of CNN for modeling long-range temporal structures especially those involving individual action stages that are critical to human action recognition. In this paper, we propose a novel action-stage (ActionS) emphasized spatiotemporal Vector of Locally Aggregated Descriptors (ActionS-STVLAD) method to aggregate informative deep features across the entire video according to adaptive video feature segmentation and adaptive segment feature sampling (AVFS-ASFS). In our ActionSST- VLAD encoding approach, by using AVFS-ASFS, the key frame features are chosen and the corresponding deep features are automatically split into segments with the features in each segment belonging to a temporally coherent ActionS. Then, based on the extracted key frame feature in each segment, a flow-guided warping technique is introduced to detect and discard redundant feature maps, while the informative ones are aggregated by using our exploited similarity weight. Furthermore, we exploit an RGBF modality to capture motion salient regions in the RGB images corresponding to action activity. Extensive experiments are conducted on four public benchmarks - HMDB51, UCF101, Kinetics and ActivityNet for evaluation. Results show that our method is able to effectively pool useful deep features spatiotemporally, leading to state-of-the-art performance for videobased action recognition.
URI: https://hdl.handle.net/10356/150982
ISSN: 1057-7149
DOI: 10.1109/TIP.2018.2890749
Rights: © 2019 IEEE. All rights reserved.
Fulltext Permission: none
Fulltext Availability: No Fulltext
Appears in Collections:EEE Journal Articles

Page view(s)

67
Updated on Jan 23, 2022

Google ScholarTM

Check

Altmetric


Plumx

Items in DR-NTU are protected by copyright, with all rights reserved, unless otherwise indicated.