Please use this identifier to cite or link to this item:
|Title:||Bridging images and natural language with deep learning||Authors:||Gu, Jiuxiang||Keywords:||Engineering::Computer science and engineering::Computing methodologies::Image processing and computer vision||Issue Date:||2019||Source:||Gu, J. (2019). Bridging images and natural language with deep learning. Doctoral thesis, Nanyang Technological University, Singapore.||Abstract:||Throughout the thesis, I demonstrate how each of the proposed methods can bridge the gap between images and natural language. Experimental results on public vision and language datasets have shown all these methods are able to obtain significant performance improvement on vision and language tasks such as image captioning and cross-modal retrieval.
We, as humans, can easily use our vision and language capabilities to accomplish a wide variety of tasks that combine the image and the text modalities. However, it is not easy for machines because it requires the model to understand the image and language, especially how they relate to each other. In recent years considerable progress has been made in applying deep learning to computer vision and natural language processing, but it is still challenging to connect images with natural language due to the different structures and characteristics between them. In this thesis, I seek to bridge images and natural language with deep learning. Five different methods are proposed to reduce the gap between the image and the text modalities. They are convolutional neural network-based language model for image captioning, coarse-to-fine learning for image captioning, visual-textual cross-modal retrieval with generative models, unpaired image captioning by language pivoting, and unpaired image captioning via scene graph alignments. Overall, the major contributions of this thesis are as follows: • A convolutional neural network-based language model which is suitable for statistical language modeling tasks is proposed in this thesis. This model is fed with all the previous words, while previous recurrent neural network-based language models predict the next word based on one previous word and the hidden state. Its ability of modeling the hierarchical structure and long-term information of words is critical for image captioning. • A coarse-to-fine multi-stage prediction framework for image captioning is proposed in this thesis. The proposed coarse-to-fine approach is composed of multiple decoders, each of which operates on the output of the previous stage, producing increasingly refined image descriptions. Particularly, I optimize the model with a reinforcement learning approach which utilizes the output of each intermediate decoder's test-time inference algorithm as well as the output of its preceding decoder to normalize the rewards. • A visual-textual cross-modal retrieval with generative learning is proposed in this thesis. Unlike existing cross-modal retrieval approaches that embed image-text pairs as single feature vectors in a common representational space, I propose to incorporate two generative processes (image-to-text and text-to-image) into the cross-modal feature embedding, through which the proposed model is able to learn not only the global abstract features but also the local grounded features. • An unpaired image captioning via language pivoting is proposed in this thesis. I use a pivot language as an intermediary language to bridge the gap between an input image and a caption in the target language. The proposed method captures the characteristics of an image captioner from the pivot language and aligns it to the target language using another pivot-target sentence parallel corpus. In order to guide the target decoder to generate caption-like sentences, I have an autoencoder in the target language that guides the target language decoder to produce caption-like sentences. • An unpaired image captioning via scene graph alignments is proposed in this thesis. The proposed framework comprises an image scene graph generator, a sentence scene graph generator, a scene graph encoder, and a sentence decoder. Specifically, I first train the scene graph encoder and the sentence decoder on the text modality. To align the scene graphs between images and sentences, I propose an unsupervised feature alignment method that maps the scene graph features from the image modality to the sentence modality without any paired training data. Throughout the thesis, I demonstrate how each of the proposed methods can bridge the gap between images and natural language. Experimental results on public vision and language datasets have shown all these methods are able to obtain significant performance improvement on vision and language tasks such as image captioning and cross-modal retrieval.
|Fulltext Permission:||open||Fulltext Availability:||With Fulltext|
|Appears in Collections:||IGS Theses|
Items in DR-NTU are protected by copyright, with all rights reserved, unless otherwise indicated.