Please use this identifier to cite or link to this item:
Title: Learning to control visual data translation
Authors: Koksal, Ali
Keywords: Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence
Engineering::Computer science and engineering::Computing methodologies::Image processing and computer vision
Issue Date: 2023
Publisher: Nanyang Technological University
Source: Koksal, A. (2023). Learning to control visual data translation. Doctoral thesis, Nanyang Technological University, Singapore.
Abstract: With the advancements in deep learning models such as Convolutional Neural Networks (CNNs) and Generative Adversarial Networks (GANs), the generation of high-dimensional data such as images, videos, etc. has achieved photo-realistic results with image-to-image translation advances. In particular, the ability to control the translation is an important aspect of modifying and synthesis novel contents. In this thesis, we address controllable high-dimensional data translation that enables three variations: (i) unpaired image-to-image translation, (ii) motion controllable video generation, and (iii) motion-aware mask-to-frame translation. In the unpaired image-to-image translation, images in a source domain are translated to a target domain where images often have different characteristics such as colors, styles, etc, with the absence of paired images in the training set. Learning to map between domains is challenging without having useful supervision that is provided by the paired images. In addition, state-of-the-art GANs for unpaired image-to-image translation are often constrained by large-sized models. In order to tackle the challenges, we introduce reconfigurable generator inspired by the observation that the mappings between two domains are often approximately invertible and multi-domain discriminator allows joint discrimination of original and translated samples from different domains. We propose two compact models that employ the reconfigurable generator and the multi-domain discriminator. The first proposed model, Reconfigurable Generative Adversarial Network (RF-GAN), achieves high-fidelity translation consistently with up to 88\% more compact model as compared with state-of-the-art GANs. The second model, Transformer-based Reconfigurable Generative Adversarial Network (TRF-GAN), replaces certain convolutions of RF-GAN's generator with transformers and further improves the translation performance with a more compact model which has approximately 25\% fewer parameters than RF-GAN. Motion controllable video generation is a variant of high-dimensional data translation where an initial frame is translated to next frames by controlling the motion of the object of interest. We address this variant as text-based control over action performed on the generated video. Building a semantic association between instructions and motion is indeed challenging because text descriptions are often ambiguous for video generation. In order to overcome the challenges, we introduce a novel framework, named Controllable Video Generation with text-based Instructions (CVGI) that allows text-based control over action performed on a video. By incorporating the motion estimation layer, the proposed framework divides the task into two subtasks: (i) control signal estimation and (ii) action generation. In control signal estimation, an encoder models actions as a set of simple motions by estimating low-level control signals for text-based instructions with given initial frames. In action generation, we employ a GAN to generate realistic videos conditioned on the estimated low-level signal. Evaluations on several datasets show the effectiveness of CVGI in generating realistic videos and in the control over actions. Although CVGI can generate realistic videos that correspond well with instructions and can control the motion according to instructions, it is limited to generating egocentric videos. Egocentric videos typically are shot by a head-mounted camera, which creates a lot of movements and causes dynamic scenes. Thus, we introduce motion-aware mask-to-frame translation where the mask of the object of interest in the next frame is translated to synthesize the next frame that should be consistent with the initial frame. In order to address the limitation of CVGI in egocentric video generation, we extend CVGI with motion-aware mask-to-frame translation where the next frame is translated from its mask by using the initial frame and mask of the object of interest in the initial frame as additional supervision. The proposed GAN uses additional supervision as input in the generator and by incorporating three discriminators that are trained to distinguish the whole frames, the object of interest, and background as real and generated. In order to generate motion controllable egocentric videos, first, masks of the object of interest are generated that correspond well with the text-based instructions, then masks of the object of interest are translated to frames by using motion-aware mask-to-frame translation GAN. Evaluations on a publicly available egocentric dataset show that the proposed GAN is capable to hallucinate pixels at the location of the object of interest on the initial frame that is indicated by the initial mask and create the object of interest at the new location that is indicated by the next mask. To sum up, we design innovative models for three variations of controllable high-dimensional data translation where a mapping function is trained to translate input high-dimensional data to novel high-dimensional data by controlling with conditions. In the scope of this thesis, we use images, frames, and the mask of the object of interest as high-dimensional data as input. We evaluate our models on benchmark datasets to show the effectiveness of the proposed frameworks. Simulating different motions can be used to train robotic systems such as robotic arms without requiring data collection of all possible motions. By predicting possible variants of future with different motions, controllable video generation also can provide useful insight for intelligent decision-making systems such as driver assistance systems and autonomous drones.
DOI: 10.32657/10356/165566
Schools: School of Computer Science and Engineering 
Rights: This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0).
Fulltext Permission: open
Fulltext Availability: With Fulltext
Appears in Collections:SCSE Theses

Files in This Item:
File Description SizeFormat 
KoksalAli_PhDThesis_FinalVersion.pdf21.87 MBAdobe PDFThumbnail

Page view(s)

Updated on Apr 19, 2024


Updated on Apr 19, 2024

Google ScholarTM




Items in DR-NTU are protected by copyright, with all rights reserved, unless otherwise indicated.