Please use this identifier to cite or link to this item:
Title: Visual recognition using deep learning (video captioning using deep learning)
Authors: Thong, Jing Lin
Keywords: Engineering::Electrical and electronic engineering
Issue Date: 2021
Publisher: Nanyang Technological University
Source: Thong, J. L. (2021). Visual recognition using deep learning (video captioning using deep learning). Final Year Project (FYP), Nanyang Technological University, Singapore.
Abstract: Video captioning refers to the process of conveying information of video clips through automatically generated natural language sentences. The unprecedented success of deep learning approaches in Computer Vision and Natural Language Processing have spurred significant progress in the research area of video captioning. Currently, video captioning has extensive applications in video surveillance, video subtitling and human-robot interaction. Most existing video captioning methods adopt the pure encoder-decoder framework, where the encoder is used to extract video features while the decoder is used to generate captions. However, even though current state-of-the-art models achieved high scores on the evaluation metrics, a significant proportion of the generated captions still do not accurately describe visual content of the videos. In this project, a comprehensive survey was conducted to identify and compare the performance of existing state-of-the-art models. A deep learning model was then developed to equip the basic encoder-decoder framework with enhanced visual reasoning capacity by incorporating additional sophisticated spatio-temporal reasoning modules. In addition, as the encoder-decoder framework only leverages on progressive flow of information to generate sentences based on extracted video features, an additional layer will be developed to establish the reverse flow and generate video features based on the generated sentences which will be compared against the original video features. Thereafter, reinforcement learning techniques were used to further optimise the model. Extensive experiments on benchmark datasets demonstrate that the overall model outperforms existing state-of-the-art methods and improves the quality of generated captions. Moreover, a user-friendly web application was designed using the Django framework to deploy the developed deep learning model. This web application allows users to upload selected videos and generate captions. Furthermore, a robust text-based search function was developed to allow users to search for their videos by entering key search terms. The report contains the design of the model, experimental results, considerations in designing the web application, a systematic guide from the user's perspective and details of the integration of the video captioning model to the web application. It concludes with a discussion of the final results and possible future extensions.
Fulltext Permission: restricted
Fulltext Availability: With Fulltext
Appears in Collections:EEE Student Reports (FYP/IA/PA/PI)

Files in This Item:
File Description SizeFormat 
FYP Final Report Thong Jing Lin.pdf
  Restricted Access
Video Captioning Using Deep Learning3.29 MBAdobe PDFView/Open

Page view(s)

Updated on Jan 18, 2022


Updated on Jan 18, 2022

Google ScholarTM


Items in DR-NTU are protected by copyright, with all rights reserved, unless otherwise indicated.