Please use this identifier to cite or link to this item: https://hdl.handle.net/10356/172665
Title: UniD3: unified discrete diffusion for simultaneous vision-language generation
Authors: Hu, Minghui
Zheng, Chuanxia
Cham, Tat-Jen
Suganthan, Ponnuthurai Nagaratnam
Yang, Zuopeng
Zheng, Heliang
Wang, Chaoyue
Tao, Dacheng
Keywords: Engineering::Computer science and engineering::Computing methodologies::Image processing and computer vision
Issue Date: 2023
Source: Hu, M., Zheng, C., Cham, T., Suganthan, P. N., Yang, Z., Zheng, H., Wang, C. & Tao, D. (2023). UniD3: unified discrete diffusion for simultaneous vision-language generation. 2023 International Conference on Learning Representations (ICLR), 1-23.
Conference: 2023 International Conference on Learning Representations (ICLR)
Abstract: The recently developed discrete diffusion model performs extraordinarily well in generation tasks, especially in the text-to-image task, showing great potential for modeling multimodal signals. In this paper, we leverage these properties and present a unified multimodal generation model, which can perform text-based, image-based, and even vision-language simultaneous generation using a single model. Specifically, we unify the discrete diffusion process for multimodal signals by proposing a unified Markov transition matrix and a unified objective. Moreover, we design a multimodal mutual attention module to highlight the inter-modal linkages, which is vital for multimodal generation. Extensive experiments indicate that our proposed method can perform comparably to the state-of-the-art solutions in various generation tasks.
URI: https://hdl.handle.net/10356/172665
URL: https://openreview.net/forum?id=8JqINxA-2a
Schools: School of Computer Science and Engineering 
Rights: © 2023 The Author(s). All rights reserved. This article may be downloaded for personal use only. Any other use requires prior permission of the copyright holder. The Version of Record is available online at https://openreview.net/forum?id=8JqINxA-2a.
Fulltext Permission: open
Fulltext Availability: With Fulltext
Appears in Collections:SCSE Conference Papers

Files in This Item:
File Description SizeFormat 
2023-Hu_etal-ICLR-UniD3_Unified_discrete_diffusion_vision_language.pdfConference full paper16.7 MBAdobe PDFThumbnail
View/Open

Page view(s)

196
Updated on Sep 16, 2024

Download(s) 50

55
Updated on Sep 16, 2024

Google ScholarTM

Check

Items in DR-NTU are protected by copyright, with all rights reserved, unless otherwise indicated.