Please use this identifier to cite or link to this item: https://hdl.handle.net/10356/157118
Full metadata record
DC FieldValueLanguage
dc.contributor.authorNguyen, Thi Ngoc Thoen_US
dc.contributor.authorWatcharasupat, Karn N.en_US
dc.contributor.authorNguyen, Ngoc Khanhen_US
dc.contributor.authorJones, Douglas L.en_US
dc.contributor.authorGan, Woon-Sengen_US
dc.date.accessioned2022-06-06T01:35:38Z-
dc.date.available2022-06-06T01:35:38Z-
dc.date.issued2022-
dc.identifier.citationNguyen, T. N. T., Watcharasupat, K. N., Nguyen, N. K., Jones, D. L. & Gan, W. (2022). SALSA: spatial cue-augmented log-spectrogram features for polyphonic sound event localization and detection. IEEE/ACM Transactions On Audio, Speech, and Language Processing, 30, 1749-1762. https://dx.doi.org/10.1109/TASLP.2022.3173054en_US
dc.identifier.issn2329-9290en_US
dc.identifier.urihttps://hdl.handle.net/10356/157118-
dc.description.abstractSound event localization and detection (SELD) consists of two subtasks, which are sound event detection and direction-of-arrival estimation. While sound event detection mainly relies on time-frequency patterns to distinguish different sound classes, direction-of-arrival estimation uses amplitude and/or phase differences between microphones to estimate source directions. As a result, it is often difficult to jointly optimize these two subtasks. We propose a novel feature called Spatial cue-Augmented Log-SpectrogrAm (SALSA) with exact time-frequency mapping between the signal power and the source directional cues, which is crucial for resolving overlapping sound sources. The SALSA feature consists of multichannel log-spectrograms stacked along with the normalized principal eigenvector of the spatial covariance matrix at each corresponding time-frequency bin. Depending on the microphone array format, the principal eigenvector can be normalized differently to extract amplitude and/or phase differences between the microphones. As a result, SALSA features are applicable for different microphone array formats such as first-order ambisonics (FOA) and multichannel microphone array (MIC). Experimental results on the TAUNIGENS Spatial Sound Events 2021 dataset with directional interferences showed that SALSA features outperformed other state-of-the-art features. Specifically, the use of SALSA features in the FOA format increased the F1 score and localization recall by 6 % each, compared to the multichannel log-mel spectrograms with intensity vectors. For the MIC format, using SALSA features increased F1 score and localization recall by 16 % and 7 %, respectively, compared to using multichannel logmel spectrograms with generalized cross-correlation spectra.en_US
dc.description.sponsorshipMinistry of Education (MOE)en_US
dc.description.sponsorshipNanyang Technological Universityen_US
dc.language.isoenen_US
dc.relationMOE2017-T2-2-060en_US
dc.relationGCP205559654en_US
dc.relation.ispartofIEEE/ACM Transactions on Audio, Speech, and Language Processingen_US
dc.rights© 2022 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. The published version is available at: https://doi.org/10.1109/TASLP.2022.3173054.en_US
dc.subjectEngineering::Electrical and electronic engineeringen_US
dc.titleSALSA: spatial cue-augmented log-spectrogram features for polyphonic sound event localization and detectionen_US
dc.typeJournal Articleen
dc.contributor.schoolSchool of Electrical and Electronic Engineeringen_US
dc.contributor.researchCentre for Infocomm Technology (INFINITUS)en_US
dc.identifier.doi10.1109/TASLP.2022.3173054-
dc.description.versionSubmitted/Accepted versionen_US
dc.identifier.volume30en_US
dc.identifier.spage1749en_US
dc.identifier.epage1762en_US
dc.subject.keywordsDeep Learningen_US
dc.subject.keywordsMicrophone Arrayen_US
dc.subject.keywordsFeature Extractionen_US
dc.subject.keywordsSound Event Localization and Detectionen_US
dc.subject.keywordsSpatial Cuesen_US
dc.description.acknowledgementThis work was supported in part by the SingaporeMinistry of Education Academic Research Fund Tier-2, under Research Grant MOE2017- T2-2-060, and in part by Google Cloud Research Credits Program under Award GCP205559654. K. N. Watcharasupat further acknowledges the support from the CN Yang Scholars Programme, Nanyang Technological University, Singapore.en_US
item.grantfulltextopen-
item.fulltextWith Fulltext-
Appears in Collections:EEE Journal Articles

Page view(s)

17
Updated on Sep 28, 2022

Download(s)

5
Updated on Sep 28, 2022

Google ScholarTM

Check

Altmetric


Plumx

Items in DR-NTU are protected by copyright, with all rights reserved, unless otherwise indicated.