Please use this identifier to cite or link to this item: https://hdl.handle.net/10356/88357
Full metadata record
DC FieldValueLanguage
dc.contributor.authorSivanagaraja, Tatinatien
dc.contributor.authorHo, Mun Kiten
dc.contributor.authorKhong, Andy Wai Hoongen
dc.contributor.authorWang, Yuboen
dc.date.accessioned2018-04-25T06:35:53Zen
dc.date.accessioned2019-12-06T17:01:26Z-
dc.date.available2018-04-25T06:35:53Zen
dc.date.available2019-12-06T17:01:26Z-
dc.date.copyright2018-01-01en
dc.date.issued2017en
dc.identifier.citationSivanagaraja, T., Ho, M. K., Khong, A. W. H., & Wang, Y. (2017). End-to-End Speech Emotion Recognition Using Multi-Scale Convolution Networks. Paper presented at 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Kuala Lumpur, Malaysia (pp. 189-192).en
dc.identifier.urihttps://hdl.handle.net/10356/88357-
dc.description.abstractAutomatic speech emotion recognition is one of the challenging tasks in machine learning community mainly due to the significant variations across individuals while expressing the same emotion cue. The success of emotion recognition with machine learning techniques primarily depends on the feature set chosen to learn. Formulation of appropriate features that cater for all variations in emotion cues however is not a trivial task. Recent works on emotion recognition with deep learning techniques thus focus on the end-to-end learning scheme which identifies the features directly from the raw speech signal instead of relying on hand-crafted feature set. Existing methods in this scheme however did not take into account the fact that speech signals often exhibit distinct features at different time scales and frequencies than in the raw form. We propose the multi- scale convolution neural network (MCNN) to identify features at different time scales and frequencies from raw speech signals. This end-to-end model leverages on the multi-branch input layer and tunable convolution layers to learn the identified features which are subsequently employed to recognize the emotion cues accordingly. As a proof-of-concept, the MCNN method with a fixed transformation stage is evaluated using the SAVEE emotion database. Results showed that MCNN improves the emotion recognition performance when compared to existing methods, which underpins the necessity of learning features at different time scales.en
dc.description.sponsorshipNRF (Natl Research Foundation, S’pore)en
dc.format.extent4 p.en
dc.language.isoenen
dc.rights© 2017 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. The published version is available at: [http://dx.doi.org/10.1109/APSIPA.2017.8282026].en
dc.subjectMachine Learningen
dc.subjectEmotion Recognitionen
dc.titleEnd-to-End Speech Emotion Recognition Using Multi-Scale Convolution Networksen
dc.typeConference Paperen
dc.contributor.schoolSchool of Electrical and Electronic Engineeringen
dc.contributor.conference2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)en
dc.identifier.doi10.1109/APSIPA.2017.8282026en
dc.description.versionAccepted versionen
dc.identifier.rims204038en
item.fulltextWith Fulltext-
item.grantfulltextopen-
Appears in Collections:EEE Conference Papers
Files in This Item:
File Description SizeFormat 
Manuscript_APSIPA2017.pdf441.76 kBAdobe PDFThumbnail
View/Open

Google ScholarTM

Check

Altmetric


Plumx

Items in DR-NTU are protected by copyright, with all rights reserved, unless otherwise indicated.