Tools for visual scene recognition
View/ Open
Author
Elahe Farahzadeh
Date of Issue
2014School
School of Computer Engineering
Research Centre
Centre for Computational Intelligence
Abstract
Scene recognition is an important step towards a full understanding of an image. This
thesis presents novel ideas related to semantic-spatial content capture and local-global
feature fusion techniques and applies them for scene recognition. It shows how the proper
use of these approaches, without trying to recognize objects in the scene images, can lead
to improvement in recognition accuracy for scene classi cation.
First, we propose a method to build a semantic visual vocabulary. The features are
extracted from the image patches and the initial vocabulary is constructed by performing
k-means clustering on the extracted features and choosing the cluster centers as the visual
words. The feature vectors are quantized based on the initial vocabulary to form a wordimage
matrix that describes the occurrence of words in the images. The codebooks
are then embedded into the concept space by latent semantic models. We demonstrate
this embedding using Latent Semantic Analysis (LSA) as well as Probabilistic Latent
Semantic Analysis (pLSA). In the proposed space, the distances between words represent
the semantic distances, which are used to construct a discriminative and semantically
meaningful vocabulary. The main contributions of the rst chapter are as follows: 1.
Using semantic word space to co-cluster similar words together to form a semantic visual
vocabulary. This will improve the results compared to other methods that use document
space directly after pLSA embedding. 2. Investigating changes in the number of latent
variables. 3. Using LSA embedding when all other vision systems to date only use pLSA.
This method has shown promising results on 15-Scene categories when the proposed
model extracts one type of visual feature.
Second, since fusing local and global features is bene cial for achieving a promising
performance for scene categorization systems [7], we propose a novel Local-Global Feature
Fusion (LGFF) method with the capability to fuse latent semantic patches adaptively. The image local feature space is embedded into the latent semantic space by employing
pLSA modeling; afterwards, this semantic space and the global contextual feature space
are mapped into a kernel-space. To perform this embedding, the global features and latent
variables are relatively weighted for each scene category. The following is a summary
of the main contributions of the second chapter: 1. Weighting latent semantic topics
based on their discriminative power 2. De ning a novel exemplar-based distance learning
3. De ning a category-dependent map function; the experimental evaluations indicate
radical improvements in the 15-Scene and 67-Indoor Scenes datasets.
Third, every scene image contains a high value of spatial information. Capturing the
spatial positions of visual features plays an essential role in scene categorization; however,
the methods proposed in the previous chapters disregard this position information.
Inspired by methods that construct pyramid levels over image primitives [8, 9], pyramid
matching kernels are employed to measure the dissimilarity scores of image pyramids.
We improved the semantic vocabulary framework by considering the position of semantic
local patches and their surrounding neighborhood properties using either the global or
region-based spatial pyramid method. In the global method, after projecting the image
into the concept space, it is divided into ner sub-scenes and then the pyramid match
kernels are employed over the proposed space for co-clustering the semantic visual words.
The region-based method initially divides the image into sub-scenes and then projects
each sub-scene into the concept space. In the Local-Global Feature Fusion framework,
by using the global features gist and CENsus TRansform hISTogram (CENTRIST), the
global spatial layout of the scene is already captured and, to further improve the results,
the image is divided into sub-regions at di erent levels of resolution. The pyramid
matching kernels are then applied over these sub-regions. The representation of these
sub-regions is obtained either by applying CENTRIST or a bag of Scale-Invariant Feature
Transform (SIFT) visual features. The experimental results outperform most of the
best published results for both the 15-Scene and 67-Indoor Scenes datasets.
This thesis concludes with a discussion of future directions for extending the proposed
works on scene recognition.
Subject
DRNTU::Engineering::Computer science and engineering::Computing methodologies::Image processing and computer vision
Type
Thesis
Collections