Front-end noise reduction algorithms for automatic speech recognition
Date of Issue2014
School of Electrical and Electronic Engineering
One of the biggest obstacles that hinders the widespread use of automatic speech recognition technology is the inability to handle noise, which includes environmental noise, channel distortion and speaker variability, etc. Towards this end, we propose several feature compensation approaches to improve the robustness of automatic speech recognition (ASR) systems: 1) direct implementation of masking effect; 2) 2D psychoacoustic filter; 3) model based noise reduction. The first two are based on psychoacoustics, and the last one includes several algorithms based on a novel feature model. More details are given as follows. The human auditory system can work properly in adverse environments, e.g. in a crowded shopping mall where thousands of people are talking loudly together with the background commercial broadcast. Therefore, modeling the human auditory system is a straightforward and logical approach to improve the performance of ASR systems. The first part of this thesis focuses on the study of masking effects, which describes how a clearly audible sound (maskee) becomes less audible because of the presence of another sound (masker). Masking effects can be classified as temporal masking and frequency masking (a.k.a. simultaneous masking). Chapter 3 introduces a novel Mel-Frequency Cepstral Coefficients (MFCC) based algorithm which simulates the properties of the human auditory system. It sequentially implements temporal masking and frequency masking in the time domain and the frequency domain, respectively. For the second contribution on psychoacoustics, we further investigate the special property of the time-frequency domain and propose the 2D psychoacoustic filter. In the time-frequency domain, the speech signal is represented over both time and frequency, which provides us the chance to address another psychoacoustic problem, i.e. temporal frequency masking. Temporal frequency masking describes the situation where the masker and maskee possess both different frequency and different commencing time. The 2D psychoacoustic filter implements not only temporal masking and frequency masking, but also temporal frequency masking and temporal integration. We also propose a unified model for the 2D psychoacoustic filter, which effectively models the equivalent masking phenomena. Mathematical derivations are provided to show the correctness of the 2D psychoacoustic filter based on the characteristic functions of masking effects. The degradation of ASR performance is mainly due to the mismatch between the statistical model trained from the clean speech and the test features derived from the noisy speech. To reduce the mismatch, we propose to recover the clean speech from the noisy speech. Two different front-end noise reduction algorithms are presented, i.e. Smoothing & Noise Subtraction (SNS) and Newton & Log Power Subtraction (NLPS). SNS tries to recover the temporal structure of the speech power spectrum. The histogram of average speech log power spectrum shows that the contamination of noise leads to a shift of the noise peak. A two-step scheme is proposed to remove noise by first reducing the noise variance and then shifting the noise mean. As for NLPS, it works by solving a nonlinear function derived from the MFCC feature extraction algorithm.
DRNTU::Engineering::Electrical and electronic engineering::Electronic systems::Signal processing