Please use this identifier to cite or link to this item: https://hdl.handle.net/10356/179571
Title: Exploring priors for visual content restoration and enhancement
Authors: Zhou, Shangchen
Keywords: Computer and Information Science
Engineering
Issue Date: 2024
Publisher: Nanyang Technological University
Source: Zhou, S. (2024). Exploring priors for visual content restoration and enhancement. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/179571
Project: HTX000ECI23000146
Abstract: As image and video have emerged as primary social mediums, the visual content restoration and enhancement have increasingly gained significance for downstream applications and systems. These include old photo and film restoration, image editing, photography enhancement, and improving the quality of visual systems like smart security, autonomous driving, and the metaverse, among others. While previous research works have introduced numerous effective techniques, the majority focus on designing and refining image-to-image mapping, typically trained and evaluated on the small datasets, overlooking the potential use of auxiliary information readily available in reality. This tendency often leads the models to overfit the training data, resulting in limited generalization and effectiveness in real-world scenarios, where degradation could be unseen, heavy, and complex, and where a domain gap exists between training and inference. To solve this issue, this thesis focuses on exploring effective visual priors for restoration and enhancement from multiple perspectives, including internal prior within the inputs, degradation prior in synthetic training data, and generative prior in the pretrained models. These priors have shown significant potential in developing superior solutions that are effective, robust, and practical. Firstly, beyond external training data, inputs often encompass valuable internal information crucial for restoration tasks. This thesis delves into exploring internal priors inherent in these inputs for restoration tasks. Specifically, we investigate the cross-scale internal prior for the image super-resolution (SR), which aims to enhance high-resolution (HR) details from low-resolution (LR) observations. Leveraging the cross-scale patch recurrence property where patches within a single natural image frequently reappear across different scales, we propose the Internal Graph Neural Network (IGNN) to model this internal correlations between cross-scale similar patches as a graph. Unlike traditional SR networks that solely learn LR-to-HR mappings from external data, IGNN utilizes the top-k most probable HR counterparts derived from the LR image itself to restore more detailed textures. The state-of-the-art performance achieved by IGNN has demonstrated the effectiveness of cross-scale internal prior for image SR task. The internal patch recurrence property also exists within adjacent frames in a video. We then delve into exploring the cross-frame internal prior for video inpainting. We propose ProPainter that integrates the recurrent propagation and cross-frame Transformer modules to gather the internal corresponding information in a frame-by-frame and direct manner, respectively. In particular, our approach introduces a reliable dual-domain propagation that merges the benefits of image and feature warping, exploiting long-range global correspondences. We also propose an efficient sparse Transformer, optimizing efficiency by discarding unnecessary and redundant tokens. With these components, ProPainter outperforms previous arts by a large margin while maintaining appealing efficiency. Secondly, to train a proficient network for a particular restoration task, it is crucial to acquire paired training data that accurately represents the degradation process. In this thesis, we investigate defining the exact degradation prior for a novel joint task of low-light image enhancement and deblurring. Capturing such training data for this task is challenging or even impossible. We thus formulate the degradation prior as a stochastic data synthesis pipeline for training. Specifically, this pipeline integrates a controllable low-light simulation and a new blur model tailored for dark scenes, which models the realistic and diverse low-light blurring degradations, facilitating the network to intrinsically learn the inverse process for this joint task. By incorporating the degradation prior in training data and delicate designs in network, our proposed LEDNet outperforms previous solutions when applied to real-world data. Thirdly, real-world low-quality images often suffer from various degradation, such as compression, blur, and noise. Restoring such images is highly ill-posed and challenging as the significant information might have been corrupted or lost, leading to the suboptimal restoration quality of earlier approaches. In this thesis, we delve into the generative prior encapsulated in the pretrained synthesis models for the restoration tasks. We first introduced CodeFormer demonstrates that the pretrained Vector Quantized Generative Adversarial Network (VQGAN) with a discrete codebook can be used as a generative prior in a small proxy space, which largely reduces the uncertainty and ambiguity of face restoration mapping, while also providing rich visual atoms for generating high-quality face images. We then cast the face restoration as a code prediction task and employ a Transformer network to model the code composition of the faces. Furthermore, we introduce a controllable feature transformation module, allowing a flexible balance between fidelity and quality. Thanks to the expressive codebook prior and global modeling, CodeFormer surpasses existing methods in both quality and fidelity, exhibiting superior robustness to degradation. We further set our sights on exploiting generative prior encapsulated in a pretrained image diffusion model for real-world video super-resolution (VSR). Leveraging this strong prior, we introduce Upscale-A-Video, a text-guided latent diffusion framework, which transfers the knowledge learned from image upscaling to video super-resolution, enabling more efficient training. This framework ensures temporal coherence through two key points: locally, it integrates temporal layers into U-Net and VAE-Decoder, maintaining consistency within short sequences; globally, a flow-guided recurrent propagation module is introduced to enhance overall video stability by propagating latent across the entire sequences. Besides, our model offers greater flexibility by allowing text prompts to guide texture creation and adjustable noise levels to balance restoration and generation, enabling a trade-off between fidelity and quality. Our model significantly outperforms existing methods in both synthetic and real-world benchmarks, as well as in AI-generated videos, showcasing impressive visual realism and texture details.
URI: https://hdl.handle.net/10356/179571
DOI: 10.32657/10356/179571
Schools: College of Computing and Data Science 
Research Centres: S-Lab
Rights: This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0).
Fulltext Permission: open
Fulltext Availability: With Fulltext
Appears in Collections:CCDS Theses

Files in This Item:
File Description SizeFormat 
Thesis_NTU_Shangchen_Zhou.pdf96.17 MBAdobe PDFView/Open

Page view(s)

151
Updated on Oct 11, 2024

Download(s) 50

48
Updated on Oct 11, 2024

Google ScholarTM

Check

Altmetric


Plumx

Items in DR-NTU are protected by copyright, with all rights reserved, unless otherwise indicated.