Please use this identifier to cite or link to this item: https://hdl.handle.net/10356/159081
Title: Enabling data-driven video production with storytelling methodologies
Authors: Dong, Yi
Keywords: Engineering::Computer science and engineering::Computer applications::Arts and humanities
Issue Date: 2022
Publisher: Nanyang Technological University
Source: Dong, Y. (2022). Enabling data-driven video production with storytelling methodologies. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/159081
Abstract: Video has become an increasingly dominant form of storytelling. Current video storytelling research mainly focuses on short video clips that are usually a few seconds in length and contain limited and focused content. It is relatively well studied due to the existence of large datasets. However, for long videos, we lack datasets annotated by domain experts. Moreover, a video story is an interdisciplinary subject that involves different modalities. Therefore, we need a holistic view to facilitate effective storytelling. I aim to fill this gap from both the data perspective and model perspective. The key contribution of this thesis is to develop a more holistic view of video storytelling through the use of data-driven approaches combined with cinematography and psychological insights. This thesis considers both a single video story and a repository of video stories. For a video story, I have studied how to model both long-term and short-term interactions among its segments using both visual and structural characteristics. The primary focus is on story-based video summarization and video paragraph captioning. For a repository of videos, I have studied video streaming based on the viewer’s emotional status to create a video therapy for the cognitively impaired elderly. Firstly, traditional user-interest-based video summarization is extended in Chapter 3 to a holistic story-based cinematography-aware approach via domain-specific editing idioms. By drawing attention to the shortage of storytelling datasets with professional editor’s decisions, the Television Commercial (TVC) dataset is proposed to contain 618 professional TVC summarization pairs with annotations of editing decisions from domain experts. Existing efforts rely on datasets containing only user interests. However, professional editors take a more holistic view, including domain-specific interest, cinematography rules, and common summarization metrics. Video summarization models are built on the established concept of editing idioms to incorporate rules of thumb for conveying a narrative. Users can efficiently explore different narrative styles with various combinations of editing idioms from a variety of domains. Secondly, segment-level recurrence video paragraph captioning is extended in Chapter 4 to a holistic graph-based approach via an extra-hop mechanism in Transformers. Video paragraph captioning needs to capture the interdependent information since multiple events coexist and even overlap in a video. We advocate that a self-attention mechanism should be enhanced by hopping across related segments to propagate surrounding and contextual information. Specifically, I propose video graph Transformers that can merge the input segment information into an event correlation graph. Then the segment-level recurrence is extended with the graph-attending ability to represent the video story with more holistic information. As a result, this holistic representation will help to generate more accurate and coherent paragraph captions. Thirdly, Chapter 5 investigates viewer emotion-aware video storytelling as a non-pharmacological therapy for the cognitively impaired elderly. I have designed a system that can select video contents from a video repository based on the viewer’s emotional status. Specifically, a sequential decision problem is formulated to consider the long-term effects of each selection. Numerical results have verified the effectiveness and robustness of the algorithm. The critical insight of this thesis is that machine learning approaches combined with traditional cinematography and psychological effects can provide a multidomain multi-scale understanding of the input video and enable video production with strong storytelling capabilities. The real-world deployment of the proposed approaches is described in Chapter 6. Finally, Chapter 7 discusses recent advances and future directions in this field.
URI: https://hdl.handle.net/10356/159081
Schools: Interdisciplinary Graduate School (IGS) 
Research Centres: Joint NTU-UBC Research Centre of Excellence in Active Living for the Elderly (LILY) 
Rights: This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0).
Fulltext Permission: open
Fulltext Availability: With Fulltext
Appears in Collections:IGS Theses

Files in This Item:
File Description SizeFormat 
DongYi-Final.pdfThesis40.16 MBAdobe PDFThumbnail
View/Open

Page view(s)

266
Updated on Mar 26, 2024

Download(s) 50

101
Updated on Mar 26, 2024

Google ScholarTM

Check

Items in DR-NTU are protected by copyright, with all rights reserved, unless otherwise indicated.