Audio pattern discovery and retrieval
Date of Issue2012
School of Computer Engineering
Emerging Research Lab
This thesis explores unsupervised algorithms for pattern discovery and retrieval in audio and speech data. In this work, audio pattern is defined as repeating audio content such as repeating music segments or words/short phrases in speech recordings. The meanings of “pattern” will be defined separately for different types of data, for example, repeating pattern discovery in music will extract segments with similar melody in music piece; In human speech, the same words/short phrases spoken by single or multiple speakers are also defined as speech patterns; In broadcast audio, repeated commercials/logo music are also considered as patterns. Previous work on audio pattern discovery focuses on either symbolizing the audio signal into token sequences followed by text-based search or using Brute-Force search techniques such as self-similarity matrix and Dynamic Time Warping. Symbolization process that relies on Vector Quantization or other modelling techniques may suffer from misclassification errors, and the exhaustive search requires high computation cost and can also be affected by channel distortion and speaker variation in audio data. Such limitations motivate me to explore more efficient and robust approaches to automatically detect repeating information in audio data. In this thesis, different unsupervised techniques are examined to analyze music and speech separately. For music, an efficient approach which extends Ukkonon's suffix tree construction algorithm is proposed to detect repeating segments. For speech data, an iterative merging approach which is based on Acoustic Segment Model (ASM) is proposed to discover recurrent phrases/words in speech. This thesis also explores the techniques of searching audio pattern in broadcast audio which consists of diverse content such as speech, music/songs, commercials, sound effects and background noise. Existing audio pattern retrieval techniques focus only on specific audio types so that their applications are limited and cannot be applied generally. In this work, a robust query-by-example framework is proposed for retrieving mixed speech and music pattern, where the ASM is examined to model music data. To verify the research, the proposed techniques are applied on both public domain audio database such as TIDIGITS corpus as well as TRECVID database and a self-collection of 30 English pop songs. The experimental results show that the proposed work achieves robust and better performance to existing techniques.
DRNTU::Engineering::Computer science and engineering::Computing methodologies::Pattern recognition