Speech recognition based on front-end noise removal algorithms
Date of Issue2014
School of Electrical and Electronic Engineering
One of the biggest obstacles that hinder the widespread use of automatic speech recognition technology is the inability to handle noise, which includes environmental noise, channel distortion and speaker variability, etc. Towards this end, we propose several feature compensation approach to improve the robustness of automatic speech recognition (ASR) systems. The human auditory system can work properly in very adverse environments, e.g. in a crowded shopping mall where thousands of people are talking loudly together with the background commercial broadcast. Therefore, analyzing and modeling human auditory system is a straightforward and logical approach to improve the performance of ASR systems. The first part of this thesis focuses on the study of masking effects, which describes how a clearly audible sound (maskee) becomes less audible because of the presence of another sound (masker). Masking effects can be classified as temporal masking and frequency masking (a.k.a. simultaneous masking). Temporal masking describes the situation where the masker (the stronger sound that diminishes others) and maskee (the weaker sound that is made less audible) possess the same frequency but different commencing time, while frequency masking describes the situation where masker and maskee possess the same commencing time but different frequency. Chapter 3 introduces a novel Mel-Frequency Cepstral Coefficients (MFCC) based algorithm which simulates the properties of the human auditory system. It sequentially implements temporal masking and frequency masking in the time domain and frequency domain, respectively. Temporal spectral averaging (TSA) and cepstral mean & variance normalization (CMVN) are used for post processing. Experimental results show that our proposed algorithm achieves very promising results. In the second study, we further investigate the special property of time-frequency domain and propose the 2D psychoacoustic filter in Chapter 4. In time-frequency domain, the speech signal is represented over both time and frequency, which provides us the chance to address another psychoacoustic problem, i.e. temporal frequency masking. Temporal frequency masking describes the situation where the masker and maskee possess both different frequency and different commencing time. The 2D psychoacoustic filter implements not only temporal masking and frequency masking but also temporal frequency masking and temporal integration. All the filters proposed in Chapter 4 are high-pass filters, so they are named 2D psychoacoustic H-filters. Mathematical derivation is provided to show the correctness of the 2D psychoacoustic filter based on the characteristic functions of masking effects. The proposed method sharpens the spectrum of the signal in both frequency and time domains. Then, we further extend the design of psychoacoustic filters. Psychoacoustic models are usually implemented in a subtractive manner (the noisy speech minuses the estimated amount of masking) with the intention to remove noise. However, it is not necessarily the only form of implementation. This thesis presents a novel algorithm which implement psychoacoustic models in an additive way. The algorithm is motivated by the fact that weak sound elements below the masking threshold are mathematically equivalent for human auditory system regardless of the actual sound pressure level, since all of them will be recognized as silence or nonexistence by human auditory system. All the filters proposed in this part are low-pass, so they are named 2D psychoacoustic 1-filters. Detailed theoretical analysis is provided to show the noise removal ability of the 2D psychoacoustic 1-filters. Experimental results show significant improvements from the proposed algorithm over the baseline Mel-frequency Cepstral Coefficients (MFCC) system in various noisy conditions. The degradation of ASR performance is mainly due to the mismatch between the statistical model trained from clean speech and the test features derived from noisy speech. To reduce the mismatch, we propose to recover the clean speech from the noisy speech. Two different front-end noise removal algorithms are presented, i.e. Smoothing and Noise Subtraction (SNS) and Newton & Log Power Subtraction (NLPS). SNS tries to recover the temporal structure of the speech power spectrum. The histogram of average speech log power spectrum shows that the contamination of noise leads to a shift of the noise peak, which in turn degrades the performance of speech recognition systems. A two-step scheme is proposed to weaken the noise effects by first reducing the noise variance and then shifting the noise mean. As for NLPS, it works by the solution of the nonlinear function derived from the MFCC feature extraction algorithm. Finally, a novel noise removal algorithm based on double transform is proposed. Time frequency domain noise removal algorithms directly process the speech spectrogram. The speech spectrogram can be treated as an image of speech signal. Following traditional image processing techniques, 2D Fourier Transform is performed. A 2D filter is adopted to remove noise in the so called double transform domain. Theoretical analysis is provided to show the effectiveness of the proposed algorithm. Extensive comparison is made against state-of-the-art robust speech recognition algorithms, namely, Lateral Inhibition (LI), Forward Masking (FM), Cepstral Mean and Variance Normalization (CMVN) , RelAtive SpecTrAl (RASTA) filter, ETSI standard advanced front-end feature extraction algorithm (AFE), Stereo-based Piecewise Linear Compensation for Environments (SPLICE) , and Mean Variance Normalization & ARMA filtering (MVA). Experimental results show that significant improvements can be obtained from the proposed algorithms.
DRNTU::Engineering::Electrical and electronic engineering