Please use this identifier to cite or link to this item:
https://hdl.handle.net/10356/137336
Title: | Single channel speech separation with constrained utterance level permutation invariant training using grid LSTM | Authors: | Xu, Chenglin Rao, Wei Xiao, Xiong Chng, Eng Siong Li, Haizhou |
Keywords: | Engineering::Computer science and engineering | Issue Date: | 2018 | Source: | Xu, C., Rao, W., Xiao, X., Chng, E. S., & Li, H. (2018). Single channel speech separation with constrained utterance level permutation invariant training using grid LSTM. Proceedings of 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 6-10. doi:10.1109/icassp.2018.8462471 | Conference: | 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) | Abstract: | Utterance level permutation invariant training (uPIT) technique is a state-of-the-art deep learning architecture for speaker independent multi-talker separation. uPIT solves the label ambiguity problem by minimizing the mean square error (MSE) over all permutations between outputs and targets. However, uPIT may be sub-optimal at segmental level because the optimization is not calculated over the individual frames. In this paper, we propose a constrained uPIT (cuPIT) to solve this problem by computing a weighted MSE loss using dynamic information (i.e., delta and acceleration). The weighted loss ensures the temporal continuity of output frames with the same speaker. Inspired by the heuristics (i.e., vocal tract continuity) in computational auditory scene analysis, we then extend the model by adding a Grid LSTM layer, that we name it as cuPIT-Grid LSTM, to automatically learn both temporal and spectral patterns over the input magnitude spectrum simultaneously. The experimental results show 9.6% and 8.5% relative improvements on WSJ0-2mix dataset under both closed and open conditions comparing with the uPIT baseline. | URI: | https://hdl.handle.net/10356/137336 | ISBN: | 9781538646588 | DOI: | 10.1109/ICASSP.2018.8462471 | Schools: | School of Computer Science and Engineering | Research Centres: | Temasek Laboratories | Rights: | © 2018 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. The published version is available at: https://doi.org/10.1109/ICASSP.2018.8462471 | Fulltext Permission: | open | Fulltext Availability: | With Fulltext |
Appears in Collections: | SCSE Conference Papers |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
SINGLE CHANNEL SPEECH SEPARATION WITH CONSTRAINED UTTERANCE LEVEL PERMUTATION INVARIANT TRAINING USING GRID LSTM.pdf | 473.12 kB | Adobe PDF | View/Open |
SCOPUSTM
Citations
5
62
Updated on Mar 26, 2024
Page view(s)
210
Updated on Mar 28, 2024
Download(s) 20
337
Updated on Mar 28, 2024
Google ScholarTM
Check
Altmetric
Items in DR-NTU are protected by copyright, with all rights reserved, unless otherwise indicated.