Please use this identifier to cite or link to this item:
https://hdl.handle.net/10356/183880
Title: | Unsupervised action segmentation and scene understanding in diverse instructional videos | Authors: | Wong, Alan Kuan Ming | Keywords: | Computer and Information Science | Issue Date: | 2025 | Publisher: | Nanyang Technological University | Source: | Wong, A. K. M. (2025). Unsupervised action segmentation and scene understanding in diverse instructional videos. Final Year Project (FYP), Nanyang Technological University, Singapore. https://hdl.handle.net/10356/183880 | Project: | CCDS24-0219 | Abstract: | With the exponential growth of online instructional videos, navigating and extracting relevant information efficiently has become increasingly challenging. This project presents an unsupervised approach for action segmentation and scene understanding in diverse instructional videos. The unsupervised action segmentation is done through leveraging both visual and auditory cues to automatically divide videos into meaningful Chapters. The proposed pipeline integrates video frame clustering using ResNet50 and k-means, scene boundary detection via PySceneDetect, and speech transcription using Faster Whisper. While initial clustering methods faced limitations in accurately segmenting content, the adoption of Google Gemini, a multimodal large language model, significantly enhanced segmentation accuracy by incorporating both visual and audio context. The final implementation is a Flask-based web application, enabling users to upload videos and receive structured outputs containing Chapter Titles, Timestamps, Descriptions, and Transcriptions, facilitating faster and more effective knowledge extraction. Experimental results demonstrate the system’s ability to segment diverse instructional videos with high accuracy and efficiency, though challenges such as audio-visual misalignment and scene transition ambiguities exist. This research contributes to the broader field of video understanding, automatic summarization, and accessibility, providing a scalable solution for structuring instructional content with the goal of improving ingestion of such video content for users. | URI: | https://hdl.handle.net/10356/183880 | Schools: | College of Computing and Data Science | Fulltext Permission: | restricted | Fulltext Availability: | With Fulltext |
Appears in Collections: | CCDS Student Reports (FYP/IA/PA/PI) |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
Alan Wong FYP Final Report.pdf Restricted Access | FYP2025 Alan Wong | 1.93 MB | Adobe PDF | View/Open |
Items in DR-NTU are protected by copyright, with all rights reserved, unless otherwise indicated.