Please use this identifier to cite or link to this item: https://hdl.handle.net/10356/178275
Title: Deep learning based speech enhancement for noise adverse environment
Authors: Tan, Zhi Wei
Keywords: Computer and Information Science
Engineering
Issue Date: 2024
Publisher: Nanyang Technological University
Source: Tan, Z. W. (2024). Deep learning based speech enhancement for noise adverse environment. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/178275
Project: MRP14 
IAF-ICP (No. I2201E0013) 
Abstract: Speech is the primary way humans communicate. Speech enhancement algorithms estimate speech from received signals. Although conventional approaches can achieve accurate estimates under low noise conditions, their performance degrades with reducing signal-to-noise ratio (SNR). This thesis introduces four novel deep learning (DL) methods for low SNR scenarios. The first work proposes a convolutional neural network (CNN) that estimates speech signals from the received signal by learning the features of noise and speech. In contrast to existing single-channel deep neural networks (DNNs), the proposed small model on low SNR (SMoLnet) better exploits higher resolution frequency signals while being parameter efficient. High-resolution frequency is effective at low SNR since it exposes more frequency bins with higher SNR. However, the filters in the convolution layers of CNN extract local input features and have a limited receptive field of high-resolution frequency received signal. Although increasing the filter length used in the convolutional filter can increase the number of local features extracted and the receptive field, it also increases the number of parameters employed by the neural network. To overcome this issue, the convolution filters are exponentially dilated so that the receptive field after each layer doubles the one before. By doing so, the final layer can leverage a large receptive field which encapsulates the large number of frequency bins provided by high-resolution frequency features. The second work proposes a transfer learning framework that leverages pre-trained single-channel neural networks to improve the training of multichannel neural networks for low SNR with scarce training data scenarios. The framework consists of a newly-formulated multichannel DNN based on the U-net architecture with exponential dilated layers and a pre-trained single-channel neural network. The multichannel DNN leverages the spectro-spatial features of a high-resolution frequency input to achieve an enhanced feature for the subsequent pre-trained single-channel DNN. Since the spatial information in the multi-channel data is dependent on the sensor array configuration and the source locations (such as the number of sensors, sensor spacing, sensor arrangement, and sensor mismatch) where the publicly available dataset is scarce compared to single-channel ones, U-net-like architecture is employed for faster convergence. In doing so, the proposed architecture can achieve good performance on a publicly available multichannel dataset while using only 10% of the training data. The third work proposes a multichannel speech enhancement framework based on time-varying neural beamformers and multichannel DNN under a low SNR. In contrast to existing DL with beamforming approaches, this approach does not require the prior direction of the source nor a large number of estimated frames. The proposed recurrent neural beamformer (R-NBF) achieves multichannel speech enhancement with speech sample spatial covariance matrix (SCM) through a feedback connection. An analysis framework based on Taylor’s first-order approximation with Wirtingers calculus. The proposed R-NBF architecture was validated using a real recorded signal from a hexacopter hovering away from a speaker in an open field. Despite such adverse noise conditions, it achieves significantly improved speech intelligibility and reduced background noise. The fourth work proposes a gridless direction-of-arrival (DOA) method using DL for narrow-band signals under low SNR scenarios with a practical array and a limited number of snapshots. More specifically, a complex CNN with a newly-formulated complex phasor normalization is proposed. The proposed approach demonstrated robustness to unseen array imperfections by learning localized phase-to-sensor relationships from the complex feature maps for SNR as low as −5 dB.
URI: https://hdl.handle.net/10356/178275
Schools: School of Electrical and Electronic Engineering 
Research Centres: ST Engineering-NTU Corporate Lab 
Delta-NTU Corporate Laboratory 
Rights: This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0).
Fulltext Permission: embargo_20250610
Fulltext Availability: With Fulltext
Appears in Collections:EEE Theses

Files in This Item:
File Description SizeFormat 
thesis_final.pdf
  Until 2025-06-10
10.02 MBAdobe PDFUnder embargo until Jun 10, 2025

Page view(s)

25
Updated on Jun 14, 2024

Google ScholarTM

Check

Items in DR-NTU are protected by copyright, with all rights reserved, unless otherwise indicated.