Please use this identifier to cite or link to this item: https://hdl.handle.net/10356/179889
Title: Controllable image and video synthesis
Authors: Jiang, Yuming
Keywords: Computer and Information Science
Issue Date: 2024
Publisher: Nanyang Technological University
Source: Jiang, Y. (2024). Controllable image and video synthesis. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/179889
Project: NTU-NAP 
MOE-2021-T1-001-088 
IAF-ICP 
MOET2EP20221-0012 
Abstract: Generative models have witnessed remarkable progress in recent years, significantly improving the quality of synthesized images and videos. This thesis extends these advancements by focusing on the controllability of generative models. In parallel with improving the synthesized quality, it is also important to make generative models have the capability of controlling the synthesized content as controllability paves the ways for user interaction and customized content creation. This thesis studies controllable image and video synthesis, including dialog-driven facial image manipulation, text-guided human image and video generation, and video generation from texts and images. This thesis first explores the manipulation of human faces. To make facial editing more controllable, the proposed method, Talk-to-Edit, edits the facial images round by round through the dialog between users and the machine. Dialog is composed of natural languages but is more expressive than natural languages. The editing requirements can be clarified in multiple rounds. Dialog-based editing requires the model to perform fine-grained editing. To support fine-grained facial editing, a continual “semantic field” is modeled in the latent space of the pretrained StyleGAN. The semantic field considers the non-linear property of the latent space. The editing is achieved by traversing the latent space along the semantic field. The trajectory is guided by linguistic controls. Beyond human faces, this thesis explores the controllable generation of human full-body images, another type of human-related media. Compared to human faces, it is much more complicated as many factors, e.g., human faces and human poses, are involved. The synthesis of human full-body images is driven by texts for more controllable synthesis. Texts specify the length and appearance of the desired clothing. A novel framework, Text2Human, is proposed for high-fidelity and diverse human full-body image generation. A hierarchical texture-aware codebook is built to store the multiple representations for the different textures at different scales. The image generation is performed by sampling from the built codebook by using a diffusion-based transformer with mixture-of-experts according to the input texts. The generated image is further refined by a feed-forward index prediction network. Based on the explorations of text-driven human image generation, this thesis then investigates text-driven human video generation, where a video sequence is synthesized from texts describing the appearance and motions of a target person. This requires maintaining the appearance of the synthesized human while performing complex motions. Text2Performer is proposed to generate vivid human videos with articulated motions from texts. Specifically, human representations are decomposed into appearance representations and pose representations. In this way, the appearance can be well maintained by fixing the appearance representations while sampling pose representations. The sampling of pose representations is achieved by a novel continuous VQ-diffuser, which directly outputs the continuous pose embeddings for better motion modeling. Finally, apart from the synthesis of human-centric content, the generation of general videos is also of great importance for the visual synthesis system. This thesis further aims to improve the controllability of the synthesis of videos containing general objects. The content of the synthesized videos is controlled by text prompts and image prompts. The motivation for introducing image prompts to text-to-video models is that using text prompts alone is not enough to depict the desired subject appearance that accurately aligns with users’ intents, especially for customized content creation. The text prompts are embedded through cross-attention modules, which is the common practice in the text-to-video generation methods. To effectively inject the information of image prompts, the proposed method, VideoBooth, injects the image prompts in a coarse-to-fine manner. Coarse visual embeddings from the image encoder provide high-level encodings of image prompts, while fine visual embeddings from the proposed attention injection module provide multi-scale and detailed encodings of image prompts. These two complementary embeddings can faithfully capture the desired appearance. Overall, this thesis presents several advancements in the field of image and video synthesis, introducing flexible control ways that enable enhanced user interaction and facilitate customized content creation.
URI: https://hdl.handle.net/10356/179889
DOI: 10.32657/10356/179889
Schools: College of Computing and Data Science 
Research Centres: S-Lab For Advanced Intelligence
Rights: This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0).
Fulltext Permission: open
Fulltext Availability: With Fulltext
Appears in Collections:CCDS Theses

Files in This Item:
File Description SizeFormat 
Jiang Yuming_Thesis_final_v2.pdf27.26 MBAdobe PDFView/Open

Page view(s)

220
Updated on Oct 9, 2024

Download(s) 20

203
Updated on Oct 9, 2024

Google ScholarTM

Check

Altmetric


Plumx

Items in DR-NTU are protected by copyright, with all rights reserved, unless otherwise indicated.