Please use this identifier to cite or link to this item: https://hdl.handle.net/10356/183880
Title: Unsupervised action segmentation and scene understanding in diverse instructional videos
Authors: Wong, Alan Kuan Ming
Keywords: Computer and Information Science
Issue Date: 2025
Publisher: Nanyang Technological University
Source: Wong, A. K. M. (2025). Unsupervised action segmentation and scene understanding in diverse instructional videos. Final Year Project (FYP), Nanyang Technological University, Singapore. https://hdl.handle.net/10356/183880
Project: CCDS24-0219
Abstract: With the exponential growth of online instructional videos, navigating and extracting relevant information efficiently has become increasingly challenging. This project presents an unsupervised approach for action segmentation and scene understanding in diverse instructional videos. The unsupervised action segmentation is done through leveraging both visual and auditory cues to automatically divide videos into meaningful Chapters. The proposed pipeline integrates video frame clustering using ResNet50 and k-means, scene boundary detection via PySceneDetect, and speech transcription using Faster Whisper. While initial clustering methods faced limitations in accurately segmenting content, the adoption of Google Gemini, a multimodal large language model, significantly enhanced segmentation accuracy by incorporating both visual and audio context. The final implementation is a Flask-based web application, enabling users to upload videos and receive structured outputs containing Chapter Titles, Timestamps, Descriptions, and Transcriptions, facilitating faster and more effective knowledge extraction. Experimental results demonstrate the system’s ability to segment diverse instructional videos with high accuracy and efficiency, though challenges such as audio-visual misalignment and scene transition ambiguities exist. This research contributes to the broader field of video understanding, automatic summarization, and accessibility, providing a scalable solution for structuring instructional content with the goal of improving ingestion of such video content for users.
URI: https://hdl.handle.net/10356/183880
Schools: College of Computing and Data Science 
Fulltext Permission: restricted
Fulltext Availability: With Fulltext
Appears in Collections:CCDS Student Reports (FYP/IA/PA/PI)

Files in This Item:
File Description SizeFormat 
Alan Wong FYP Final Report.pdf
  Restricted Access
FYP2025 Alan Wong1.93 MBAdobe PDFView/Open

Page view(s)

59
Updated on May 7, 2025

Download(s)

4
Updated on May 7, 2025

Google ScholarTM

Check

Items in DR-NTU are protected by copyright, with all rights reserved, unless otherwise indicated.