Please use this identifier to cite or link to this item:
Title: Video summarization of person of interest (POI)
Authors: Lit, Laura Pei Lin
Keywords: Engineering::Computer science and engineering
Issue Date: 2021
Publisher: Nanyang Technological University
Source: Lit, L. P. L. (2021). Video summarization of person of interest (POI). Final Year Project (FYP), Nanyang Technological University, Singapore.
Project: SCSE21-0546
Abstract: With the increase of video content available, there is a greater need for data management of these digital media. Video summarization aims to create a succinct and comprehensive synopsis through the selection of key details from video media [1]. Most video summarization models summarize the entire video without any prior trimming of key details which may lead to excess information being provided. Conventional video summarization models only provide one summarized statement for the entire video, which often results in a very broad description of activities that happened in the video. The first contribution of this project is a new enhanced video summarization model which provides additional information centered around a particular Person of interest (POI). A new pipeline is developed for video summarization of POI using deep-learning based methods, providing further insights on POI face, action and clothing. Face recognition and detection are first used to identify the POI within the video.When the identity of the POI is identified using face recognition, clothes descriptors are applied on the POI to identify what clothing they are wearing. Finally, the video is trimmed to only include parts containing the POI for more precise video summarization, accurately deriving the key activities the POI is involved in. Multiple state-of-the-art face detection, mask classification and face recognition models have been explored and integrated into the new pipeline to achieve this goal. Convolutional Neural Networks (CNN), such as Resnet 50, are used for classification and Multi-Task Cascaded Convolutional Network (MTCNN) are used for face recognition, while object detection model You Only Look Once (YOLO) is used for human extraction. K-means clustering is used for color extraction of POI clothes. The second contribution is the enhancement of the accuracy of the various individual component’s ability to extract and classify the various objects, and thus justify the selections made for the pipeline. The use of Face Detection using DLIB has achieve an accuracy of 88.2%, however, the enhanced facial recognition model, which includes the use of both Multi-Task Cascaded Convolutional Network (MTCNN) and Face Detection using DLIB, achieved an accuracy of 94.9%, a 6% increased in overall accuracy. While the mask classification model trained using ResNet 50 achieved an accuracy of 98.11%. An overall evaluation of the model and its use cases conclude the report, with possible further expansions such as real-time video detection and optimisation of descriptors. Keywords: Convolutional Neural Network, Face Detection, Face Recognition, Object detection, Video Summarization, You Only Look Once
Schools: School of Computer Science and Engineering 
Organisations: Home Team Science and Technology Agency (HTX)
Fulltext Permission: restricted
Fulltext Availability: With Fulltext
Appears in Collections:SCSE Student Reports (FYP/IA/PA/PI)

Files in This Item:
File Description SizeFormat 
lauralit fyp.pdf
  Restricted Access
13.59 MBAdobe PDFView/Open

Page view(s)

Updated on May 30, 2023


Updated on May 30, 2023

Google ScholarTM


Items in DR-NTU are protected by copyright, with all rights reserved, unless otherwise indicated.