Please use this identifier to cite or link to this item:
https://hdl.handle.net/10356/178464
Title: | Q-instruct: improving low-level visual abilities for multi-modality foundation models | Authors: | Wu, Haoning Zhang, Zicheng Zhang, Erli Chen, Chaofeng Liao, Liang Wang, Annan Xu, Kaixin Li, Chunyi Hou, Jingwen Zhai, Guangtao Xue, Geng Sun, Wenxiu Yan, Qiong Lin, Weisi |
Keywords: | Computer and Information Science | Issue Date: | 2024 | Source: | Wu, H., Zhang, Z., Zhang, E., Chen, C., Liao, L., Wang, A., Xu, K., Li, C., Hou, J., Zhai, G., Xue, G., Sun, W., Yan, Q. & Lin, W. (2024). Q-instruct: improving low-level visual abilities for multi-modality foundation models. 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 25490-25500. | Conference: | 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) | Abstract: | Multi-modality foundation models, as represented by GPT-4V, have brought a new paradigm for low-level visual perception and understanding tasks, that can respond to a broad range of natural human instructions in a model. While existing foundation models have shown exciting potentials on low-level visual tasks, their related abilities are still preliminary and need to be improved. In order to enhance these models, we conduct a large-scale subjective experiment collecting a vast number of real human feedbacks on low-level vision. Each feedback follows a pathway that starts with a detailed description on the low-level visual appearance (*e.g. clarity, color, brightness* of an image, and ends with an overall conclusion, with an average length of 45 words. The constructed **Q-Pathway** dataset includes 58K detailed human feedbacks on 18,973 images with diverse low-level appearance. Moreover, to enable foundation models to robustly respond to diverse types of questions, we design a GPT-participated conversion to process these feedbacks into diverse-format 200K instruction-response pairs. Experimental results indicate that the **Q-Instruct** consistently elevates low-level perception and understanding abilities across several foundational models. We anticipate that our datasets can pave the way for a future that general intelligence can perceive, understand low-level visual appearance and evaluate visual quality like a human. Our dataset, model zoo, and demo is published at: https://q-future.github.io/Q-Instruct. | URI: | https://hdl.handle.net/10356/178464 | URL: | http://arxiv.org/abs/2311.06783v1 https://openaccess.thecvf.com/content/CVPR2024/papers/Wu_Q-Instruct_Improving_Low-level_Visual_Abilities_for_Multi-modality_Foundation_Models_CVPR_2024_paper.pdf |
DOI (Related Dataset): | 10.21979/N9/GPLPNI | Schools: | College of Computing and Data Science | Research Centres: | S-Lab | Rights: | © 2024 IEEE. All rights reserved. This article may be downloaded for personal use only. Any other use requires prior permission of the copyright holder. | Fulltext Permission: | open | Fulltext Availability: | With Fulltext |
Appears in Collections: | CCDS Conference Papers |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
Wu_Q-Instruct_Improving_Low-level_Visual_Abilities_for_Multi-modality_Foundation_Models_CVPR_2024_paper.pdf | 1.73 MB | Adobe PDF | View/Open |
Page view(s)
72
Updated on Sep 11, 2024
Download(s)
16
Updated on Sep 11, 2024
Google ScholarTM
Check
Items in DR-NTU are protected by copyright, with all rights reserved, unless otherwise indicated.