Please use this identifier to cite or link to this item: https://hdl.handle.net/10356/162628
Title: Context-aware visual policy network for fine-grained image captioning
Authors: Zha, Zheng-Jun
Liu, Daqing
Zhang, Hanwang
Zhang, Yongdong
Wu, Feng
Keywords: Engineering::Computer science and engineering
Issue Date: 2019
Source: Zha, Z., Liu, D., Zhang, H., Zhang, Y. & Wu, F. (2019). Context-aware visual policy network for fine-grained image captioning. IEEE Transactions On Pattern Analysis and Machine Intelligence, 44(2), 710-722. https://dx.doi.org/10.1109/TPAMI.2019.2909864
Journal: IEEE Transactions on Pattern Analysis and Machine Intelligence
Abstract: With the maturity of visual detection techniques, we are more ambitious in describing visual content with open-vocabulary, fine-grained and free-form language, i.e., the task of image captioning. In particular, we are interested in generating longer, richer and more fine-grained sentences and paragraphs as image descriptions. Image captioning can be translated to the task of sequential language prediction given visual content, where the output sequence forms natural language description with plausible grammar. However, existing image captioning methods focus only on language policy while not visual policy, and thus fail to capture visual context that are crucial for compositional reasoning such as object relationships (e.g., "man riding horse") and visual comparisons (e.g., "small(er) cat"). This issue is especially severe when generating longer sequences such as a paragraph. To fill the gap, we propose a Context-Aware Visual Policy network (CAVP) for fine-grained image-to-language generation: image sentence captioning and image paragraph captioning. During captioning, CAVP explicitly considers the previous visual attentions as context, and decides whether the context is used for the current word/sentence generation given the current visual attention. Compared against traditional visual attention mechanism that only fixes a single visual region at each step, CAVP can attend to complex visual compositions over time. The whole image captioning model-CAVP and its subsequent language policy network-can be efficiently optimized end-to-end by using an actor-critic policy gradient method. We have demonstrated the effectiveness of CAVP by state-of-the-art performances on MS-COCO and Stanford captioning datasets, using various metrics and sensible visualizations of qualitative visual context.
URI: https://hdl.handle.net/10356/162628
ISSN: 0162-8828
DOI: 10.1109/TPAMI.2019.2909864
Rights: © 2019 IEEE. All rights reserved.
Fulltext Permission: none
Fulltext Availability: No Fulltext
Appears in Collections:SCSE Journal Articles

SCOPUSTM   
Citations 20

19
Updated on Dec 2, 2022

Web of ScienceTM
Citations 10

38
Updated on Dec 3, 2022

Page view(s)

11
Updated on Dec 3, 2022

Google ScholarTM

Check

Altmetric


Plumx

Items in DR-NTU are protected by copyright, with all rights reserved, unless otherwise indicated.