Please use this identifier to cite or link to this item: https://hdl.handle.net/10356/169609
Title: Learning decoupled models for cross-modal generation
Authors: Wang, Hao
Keywords: Engineering::Computer science and engineering
Issue Date: 2023
Publisher: Nanyang Technological University
Source: Wang, H. (2023). Learning decoupled models for cross-modal generation. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/169609
Abstract: Cross-modal generation is playing an important role in translating information between different data modalities, such as image, video and text. Two representative tasks under the cross-modal generation umbrella are visual-to-text generation and text-to-visual generation. For the visual-to-text generation task, most existing methods adopt a pretrained object detection model to extract the image object features, from which they generate textual descriptions. However, the pretrained model cannot always produce correct results for different domain data, hence the generated captions may fail to faithfully present all the visual contents. For the text-to-visual generation task, the traditional way is to use the text-conditioned Generative Adversarial Network (GAN) architecture to generate images, where the image generation training and cross-modal similarity learning are coupled. This may decrease image generation quality and diversity. In this thesis, we focus on two main research questions. Firstly, in the visual-to-text generation task, how can we learn decoupled models for food image and complex video datasets, which contain mixed ingredients and domain-specific object classes that are not included during the object detection model pretraining? Secondly, in the text-to-visual generation task, how can we decouple the image generation training and cross-modal similarity learning, so that the text-guided image generation and manipulation tasks can be conducted in the same framework and increase the generation quality? In order to tackle these research questions, we propose learning decoupled models for the cross-modal generation tasks. Compared with commonly coupled model architectures, decoupling different model components enables each of them to be learned effectively, so that the source modality can be translated to the target modality more easily.
URI: https://hdl.handle.net/10356/169609
DOI: 10.32657/10356/169609
Schools: School of Computer Science and Engineering 
Rights: This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0).
Fulltext Permission: open
Fulltext Availability: With Fulltext
Appears in Collections:SCSE Theses

Files in This Item:
File Description SizeFormat 
thesis_wanghao.pdf21.09 MBAdobe PDFThumbnail
View/Open

Page view(s)

316
Updated on May 7, 2025

Download(s) 20

358
Updated on May 7, 2025

Google ScholarTM

Check

Altmetric


Plumx

Items in DR-NTU are protected by copyright, with all rights reserved, unless otherwise indicated.