A framework for associated news story retrieval.
Date of Issue2013
School of Computer Engineering
Centre for Multimedia and Network Technology
Video retrieval -- searching and retrieving videos relevant to a given query -- is one of the most popular topics in both real life applications and multimedia research. Finding relevant video content is important for producers of television news, documentaries and commercials. Particularly, in news domain, hundreds of news stories in many different languages are being published everyday by the numerous news agencies and media houses. The huge number of published news stories brings enormous challenges in developing techniques for their efficient retrieval. In particular, there is the challenge of identifying two news clips that discuss the same story. Here, the visual information need not be similar enough for simple near-duplicate video detection algorithms to work. Although, visually two news stories might be different, they might be addressing the same main topic. We call such news stories as associated new stories and the main objective in this thesis is to identify such stories. Therefore, it is imperative that we resort to other modalities such as speech and text for robust retrieval of associated news stories. In the visual domain, associated news stories can be seen as duplicate, near-duplicate, partially near-duplicate videos or in more challenging cases as videos sharing specific visual concepts (e.g. fire, storm, strike, etc). We study Near-Duplicate Keyframe (NDK) identification task as the main core of the visual analysis using different global and local features such as Scale-Invariant Feature Transformation (SIFT). We propose the Constraint Symmetric Matching scheme to match SIFT descriptors between two keyframes and also incorporate other features such as color to tackle the NDK detection task. Next, we cluster keyframes within a news story if they are NDKs and generate a novel scene-level video signature, called scene signature, for each NDK cluster. A scene signature is essentially a Bag-of-SIFT containing both common and distinct visual cues within an NDK cluster and is more compact and discriminative compared to the keyframe-level local feature representation. In addition to scene signature, we generate a visual semantic signature for a news video which is a 374-dimensional feature indicating the probability of the presence of the predefined visual concepts in a news story. We integrate these two sources of visual knowledge (i.e. scene signature and semantic signature) to determine enhanced visual content similarity between two stories. In the textual domain, associated news stories usually have common spoken words (by anchor or reporter) and/or displayed words (appear as a closed caption) which can be extracted through Automatic Speech Recognition (ASR) and Optical Character Recognition (OCR), respectively. Since OCR transcripts usually have high error rate, we propose a novel post-processing approach based on the local dictionary idea to recover the erroneous OCR output and identify more informative words, called keywords. We generate an enhanced textual content representation using ASR transcript and OCR keywords through an early fusion scheme. We also employ textual semantic similarity to measure the relatedness of the textual features. Finally, we incorporate all enhanced textual and visual representations/similarities through an early/late fusion scheme, respectively, to investigate their complementary role in the associated news story retrieval task. In the proposed early fusion, we retrieve visual semantics, determined as the visual semantic signature, using textual information provided by ASR and OCR. In the late fusion, we combine enhanced textual and visual content similarities and early fusion similarity through a learning process to boost the retrieval performance. We evaluate the proposed NDK retrieval, detection and clustering approaches in extensive experiments on standard datasets. We also assess the effectiveness and compactness of the proposed scene signature to represent a video compared to other local and global video signatures using a web video dataset. Finally, we show the usefulness of multi-modal approaches using different textual and visual modalities to retrieve associated news stories.
DRNTU::Engineering::Computer science and engineering::Information systems::Information systems applications