dc.contributor.authorDennis, Jonathan William
dc.date.accessioned2014-04-29T01:57:59Z
dc.date.accessioned2017-07-23T08:29:59Z
dc.date.available2014-04-29T01:57:59Z
dc.date.available2017-07-23T08:29:59Z
dc.date.copyright2014en_US
dc.date.issued2014
dc.identifier.citationDennis, J. W. (2014). Sound event recognition in unstructured environments using spectrogram image processing. Doctoral thesis, Nanyang Technological University, Singapore.
dc.identifier.urihttp://hdl.handle.net/10356/59272
dc.description.abstractThe objective of this research is to develop feature extraction and classification techniques for the task of sound event recognition (SER) in unstructured environments. Although this field is traditionally overshadowed by the popular field of automatic speech recognition (ASR), an SER system that can achieve human-like sound recognition performance opens up a range of novel application areas. These include acoustic surveillance, bio-acoustical monitoring, environmental context detection, healthcare applications and more generally the rich transcription of acoustic environments. The challenge in such environments are the adverse effects such as noise, distortion and multiple sources, which are more likely to occur with distant microphones compared to the close-talking microphones that are more common in ASR. In addition, the characteristics of acoustic events are less well defined than those of speech, and there is no sub-word dictionary available like the phonemes in speech. Therefore, the performance of ASR systems typically degrades dramatically in these challenging unstructured environments, and it is important to develop new methods that can perform well for this challenging task. In this thesis, the approach taken is to interpret the sound event as a two-dimensional spectrogram image, with the two axes as the time and frequency dimensions. This enables novel methods for SER to be developed based on spectrogram image processing, which are inspired by techniques from the field of image processing. The motivation for such an approach is based on finding an automatic approach to ``spectrogram reading'', where it is possible for humans to visually recognise the different sound event signatures in the spectrogram. The advantages of such an approach are twofold. Firstly, the sound event image representation makes it possible to naturally capture the sound information in a two-dimensional feature. This has advantages over conventional one-dimensional frame-based features, which capture only a slice of spectral information within a short time window. Secondly, the problem of detecting sound events in mixtures containing noise or overlapping sounds can be formulated in a way that is similar to image classification and object detection in the field of image processing. This makes it possible to draw on previous works in the field, taking into account the fundamental differences between spectrograms and conventional images. With this new perspective, three novel solutions to the challenging task of robust SER are developed in this thesis. In the first study, a method for robust sound classification is developed called the Spectrogram Image Feature (SIF), which is based on a global image feature extracted directly from the time-frequency spectrogram of the sound. This in turn leads to the development of a novel sound event image representation called the Subband Power Distribution (SPD) image. This is derived as an image representation of the stochastic distribution of spectral power over the sound clip, and can overcome some of the issues of extracting image features directly from the spectrogram. In the final study, the challenging task of simultaneous recognition of overlapping sounds in noisy environments is considered. An approach is proposed based on inspiration from object recognition in image processing, where the task of finding an object in a cluttered scene has many parallels with detecting a sound event overlapped with other sources and noise. The proposed framework combines keypoint detection and local spectrogram feature extraction, with a model that captures the geometrical distribution of the keypoints over time, frequency and spectral power. For each of the proposed systems detailed experimental evaluation is carried out to compare the performance against a range of state-of-the-art systems.en_US
dc.format.extent208 p.en_US
dc.language.isoenen_US
dc.subjectDRNTU::Engineering::Computer science and engineering::Computing methodologies::Pattern recognitionen_US
dc.subjectDRNTU::Science::Mathematics::Applied mathematics::Signal processingen_US
dc.subjectDRNTU::Engineering::Computer science and engineering::Computing methodologies::Image processing and computer visionen_US
dc.titleSound event recognition in unstructured environments using spectrogram image processingen_US
dc.typeThesis
dc.contributor.supervisor2Tran Huy Daten_US
dc.contributor.schoolSchool of Computer Engineeringen_US
dc.contributor.supervisorChng Eng Siongen_US
dc.description.degreeDOCTOR OF PHILOSOPHY (SCE)en_US
dc.identifier.doihttps://doi.org/10.32657/10356/59272
dc.contributor.organizationA*STAR Institute for Infocomm Researchen_US


Files in this item

FilesSizeFormatView
phd_thesis_NTU_print.pdf3.004Mbapplication/pdfView/Open

This item appears in the following Collection(s)

Show simple item record