Please use this identifier to cite or link to this item:
|Human action recognition using artificial intelligence
|Engineering::Computer science and engineering::Computing methodologies::Image processing and computer vision
|Nanyang Technological University
|Wang, H. (2022). Human action recognition using artificial intelligence. Final Year Project (FYP), Nanyang Technological University, Singapore. https://hdl.handle.net/10356/157639
|Video action recognition is one of the specific tasks of video understanding, which aims to generate an action label, containing a verb and a noun, for a given video segment. As many other video understanding tasks, video action recognition is continuously under exploration of researchers and is at the same time, extensively applied to many real-life applications, like automatic driving, human-robot interaction, etc. Former researchers have established several different methods, including hand-crafted features, two-stream networks, 3D CNNs, etc. The fundamental difference among those methods is that they use different spatial-temporal modelling to capture both the spatial details and temporal relation in video segments, which are the keys for video tasks. However, due to the complexity of modelling such information, trade-off must always be made between a high accuracy and computational cost. Beside the prediction model, dataset is also crucial to video tasks as its scale and variety in action categories definitely help models pre-trained on it work better when deployed in real-life applications. In this project, a survey about various former action recognition method and action recognition dataset was conducted in order to comprehensively understand the problems mentioned above, and to evaluate and compare across the performance of the existing state-of-the-art methods. Then an efficient deep learning model was proposed to take advantage of 1) the cheap computation of 2D CNNs, 2) the ability of long-range temporal modelling of two-stream networks and 3D CNNs. The largest dataset in egocentric vision was selected as the benchmark dataset to compare the proposed model over its baseline. Extensive experiments were designed and conducted to analyse the results, which showed the proposed method has single digit accuracy improvement over the state-of-the-art. This report consists of the insights gained from survey about video action recognition models and dataset, the design of an efficient models, the experiment results with comparisons and discussions, and most important, the reflection about the design and development of the model and its performance. A short conclusion and a glimpse towards future works are made at the end.
|School of Electrical and Electronic Engineering
|Appears in Collections:
|EEE Student Reports (FYP/IA/PA/PI)
Updated on Feb 19, 2024
Updated on Feb 19, 2024
Items in DR-NTU are protected by copyright, with all rights reserved, unless otherwise indicated.