Tools for visual scene recognition
Date of Issue2014
School of Computer Engineering
Centre for Computational Intelligence
Scene recognition is an important step towards a full understanding of an image. This thesis presents novel ideas related to semantic-spatial content capture and local-global feature fusion techniques and applies them for scene recognition. It shows how the proper use of these approaches, without trying to recognize objects in the scene images, can lead to improvement in recognition accuracy for scene classi cation. First, we propose a method to build a semantic visual vocabulary. The features are extracted from the image patches and the initial vocabulary is constructed by performing k-means clustering on the extracted features and choosing the cluster centers as the visual words. The feature vectors are quantized based on the initial vocabulary to form a wordimage matrix that describes the occurrence of words in the images. The codebooks are then embedded into the concept space by latent semantic models. We demonstrate this embedding using Latent Semantic Analysis (LSA) as well as Probabilistic Latent Semantic Analysis (pLSA). In the proposed space, the distances between words represent the semantic distances, which are used to construct a discriminative and semantically meaningful vocabulary. The main contributions of the rst chapter are as follows: 1. Using semantic word space to co-cluster similar words together to form a semantic visual vocabulary. This will improve the results compared to other methods that use document space directly after pLSA embedding. 2. Investigating changes in the number of latent variables. 3. Using LSA embedding when all other vision systems to date only use pLSA. This method has shown promising results on 15-Scene categories when the proposed model extracts one type of visual feature. Second, since fusing local and global features is bene cial for achieving a promising performance for scene categorization systems , we propose a novel Local-Global Feature Fusion (LGFF) method with the capability to fuse latent semantic patches adaptively. The image local feature space is embedded into the latent semantic space by employing pLSA modeling; afterwards, this semantic space and the global contextual feature space are mapped into a kernel-space. To perform this embedding, the global features and latent variables are relatively weighted for each scene category. The following is a summary of the main contributions of the second chapter: 1. Weighting latent semantic topics based on their discriminative power 2. De ning a novel exemplar-based distance learning 3. De ning a category-dependent map function; the experimental evaluations indicate radical improvements in the 15-Scene and 67-Indoor Scenes datasets. Third, every scene image contains a high value of spatial information. Capturing the spatial positions of visual features plays an essential role in scene categorization; however, the methods proposed in the previous chapters disregard this position information. Inspired by methods that construct pyramid levels over image primitives [8, 9], pyramid matching kernels are employed to measure the dissimilarity scores of image pyramids. We improved the semantic vocabulary framework by considering the position of semantic local patches and their surrounding neighborhood properties using either the global or region-based spatial pyramid method. In the global method, after projecting the image into the concept space, it is divided into ner sub-scenes and then the pyramid match kernels are employed over the proposed space for co-clustering the semantic visual words. The region-based method initially divides the image into sub-scenes and then projects each sub-scene into the concept space. In the Local-Global Feature Fusion framework, by using the global features gist and CENsus TRansform hISTogram (CENTRIST), the global spatial layout of the scene is already captured and, to further improve the results, the image is divided into sub-regions at di erent levels of resolution. The pyramid matching kernels are then applied over these sub-regions. The representation of these sub-regions is obtained either by applying CENTRIST or a bag of Scale-Invariant Feature Transform (SIFT) visual features. The experimental results outperform most of the best published results for both the 15-Scene and 67-Indoor Scenes datasets. This thesis concludes with a discussion of future directions for extending the proposed works on scene recognition.
DRNTU::Engineering::Computer science and engineering::Computing methodologies::Image processing and computer vision