Towards high performance phonotactic feature for spoken language recognition
School of Computer Engineering
With the demands of globalization, multilingual speech is increasingly common in conversational telephone speech, broadcast news and internet podcasts. Therefore, automatic spoken language recognition has become an important technology in multilingual speech related applications. For example, automatic spoken language recognition has been used as a preprocessing component for spoken language translation, multilingual speech recognition and spoken document retrieval. Both humans and machines rely on certain informative cues to differentiate one language from another. Inspired by the findings in the discriminative cues for human language recognition, most of the automatic language recognition systems rely on the following three features: acoustic, prosodic and phonotactic. Acoustic features capture spectral characteristics and can be obtained from short-term speech signals. Prosodic features such as tone, intonation, prominence and rhythm can be derived from energy measurements, pitch contour, rate of change. Phonotactic features capture the statistics of lexical constraints and phonotactic patterns. Phonotactic features can be generated from a tokenization front end which converts speech signals into sequences of sound patterns. This thesis focuses on the study of effective phonotactic feature extraction methods for high performance automatic language recognition. Specifically, the main contributions of this thesis are: A novel target-oriented method is proposed to construct parallel phone recognizers for robust phonotactic feature extraction. A subset of the most discriminative phones from an existing phone recognizer is selected to form a target-oriented phone tokenizer (TOPT). The TOPT phone tokenizers, one for each of the target languages, are constructed from an existing phone recognizer without requiring additional transcribed training data. A target-aware language models (TALM) method is proposed to generate phone tokenizers by constructing a set of phone language models, each dedicated to a target language. In the front-end decoding process with TALM, all the phone models of the original phone recognizer are used, and they are constrained by target-aware language models. Each target-aware language model emphasize on the discriminative ability of phones for a specific target language. An automatic relevance feedback technique is proposed to incorporate more language information in language recognition with short utterances. The idea is to augment the short input utterance with relevant utterances from the reference corpus. In this way, the short utterances are augmented with richer information and better language recognition accuracy can be achieved. A feature selection method is proposed to reduce redundant phonotactic information to make the language recognition system more efficient. The dimensional reduction is achieved by measuring the importance of features using two different criteria: contribution to SVM separation margin and Chi-squared value.
DRNTU::Engineering::Computer science and engineering::Computer systems organization::Performance of systems