Please use this identifier to cite or link to this item:
Full metadata record
DC FieldValueLanguage
dc.contributor.authorTan, Zhi Weien_US
dc.identifier.citationTan, Z. W. (2024). Deep learning based speech enhancement for noise adverse environment. Doctoral thesis, Nanyang Technological University, Singapore.
dc.description.abstractSpeech is the primary way humans communicate. Speech enhancement algorithms estimate speech from received signals. Although conventional approaches can achieve accurate estimates under low noise conditions, their performance degrades with reducing signal-to-noise ratio (SNR). This thesis introduces four novel deep learning (DL) methods for low SNR scenarios. The first work proposes a convolutional neural network (CNN) that estimates speech signals from the received signal by learning the features of noise and speech. In contrast to existing single-channel deep neural networks (DNNs), the proposed small model on low SNR (SMoLnet) better exploits higher resolution frequency signals while being parameter efficient. High-resolution frequency is effective at low SNR since it exposes more frequency bins with higher SNR. However, the filters in the convolution layers of CNN extract local input features and have a limited receptive field of high-resolution frequency received signal. Although increasing the filter length used in the convolutional filter can increase the number of local features extracted and the receptive field, it also increases the number of parameters employed by the neural network. To overcome this issue, the convolution filters are exponentially dilated so that the receptive field after each layer doubles the one before. By doing so, the final layer can leverage a large receptive field which encapsulates the large number of frequency bins provided by high-resolution frequency features. The second work proposes a transfer learning framework that leverages pre-trained single-channel neural networks to improve the training of multichannel neural networks for low SNR with scarce training data scenarios. The framework consists of a newly-formulated multichannel DNN based on the U-net architecture with exponential dilated layers and a pre-trained single-channel neural network. The multichannel DNN leverages the spectro-spatial features of a high-resolution frequency input to achieve an enhanced feature for the subsequent pre-trained single-channel DNN. Since the spatial information in the multi-channel data is dependent on the sensor array configuration and the source locations (such as the number of sensors, sensor spacing, sensor arrangement, and sensor mismatch) where the publicly available dataset is scarce compared to single-channel ones, U-net-like architecture is employed for faster convergence. In doing so, the proposed architecture can achieve good performance on a publicly available multichannel dataset while using only 10% of the training data. The third work proposes a multichannel speech enhancement framework based on time-varying neural beamformers and multichannel DNN under a low SNR. In contrast to existing DL with beamforming approaches, this approach does not require the prior direction of the source nor a large number of estimated frames. The proposed recurrent neural beamformer (R-NBF) achieves multichannel speech enhancement with speech sample spatial covariance matrix (SCM) through a feedback connection. An analysis framework based on Taylor’s first-order approximation with Wirtingers calculus. The proposed R-NBF architecture was validated using a real recorded signal from a hexacopter hovering away from a speaker in an open field. Despite such adverse noise conditions, it achieves significantly improved speech intelligibility and reduced background noise. The fourth work proposes a gridless direction-of-arrival (DOA) method using DL for narrow-band signals under low SNR scenarios with a practical array and a limited number of snapshots. More specifically, a complex CNN with a newly-formulated complex phasor normalization is proposed. The proposed approach demonstrated robustness to unseen array imperfections by learning localized phase-to-sensor relationships from the complex feature maps for SNR as low as −5 dB.en_US
dc.publisherNanyang Technological Universityen_US
dc.relationIAF-ICP (No. I2201E0013)en_US
dc.rightsThis work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0).en_US
dc.subjectComputer and Information Scienceen_US
dc.titleDeep learning based speech enhancement for noise adverse environmenten_US
dc.typeThesis-Doctor of Philosophyen_US
dc.contributor.supervisorAndy Khong W Hen_US
dc.contributor.schoolSchool of Electrical and Electronic Engineeringen_US
dc.description.degreeDoctor of Philosophyen_US
dc.contributor.researchST Engineering-NTU Corporate Laben_US
dc.contributor.researchDelta-NTU Corporate Laboratoryen_US
dc.subject.keywordsSpeech enhancementen_US
dc.subject.keywordsNeural beamformingen_US
dc.subject.keywordsDeep learningen_US
dc.subject.keywordsLow signal-to-noise ratioen_US
dc.subject.keywordsTransfer learningen_US
dc.subject.keywordsScarse training dataen_US
dc.subject.keywordsArray signal processingen_US
item.fulltextWith Fulltext-
Appears in Collections:EEE Theses
Files in This Item:
File Description SizeFormat 
  Until 2025-06-10
10.02 MBAdobe PDFUnder embargo until Jun 10, 2025

Page view(s)

Updated on Jul 17, 2024

Google ScholarTM




Items in DR-NTU are protected by copyright, with all rights reserved, unless otherwise indicated.