Processing of speech utterances for computer aided training of speaking skills
Date of Issue2014
School of Electrical and Electronic Engineering
Institute for Media Innovation
Computer-aided language learning (CALL) involves the studies and applications of speech and language processing technologies to improve the process of language acquisition. Ideally, an effective computer-aided language learning system should be able to accurately assess the performance of a language learner and generate meaningful feedback during his learning process. This thesis addresses several issues which are relevant to computer-aided language learning systems, particularly for learning of English as a second language (L2). The first issue is about the evaluation of prosody of the learner’s speech utterances. Prosody evaluation plays an important role in automatic assessment of English proficiency of L2 learners. It requires segmentation of a speech utterance into appropriate units to achieve effective modeling of prosodic features. A segmentation scheme is proposed to improve the prosody evaluation results by taking into account prosodic units. Unlike lexical units such as word or phoneme, prosodic units correspond to the phrasing and rhythm information and are more appropriate for the purpose of prosody evaluation. An algorithm is designed to segment the speech signal into prosodic units automatically, and it is shown that the algorithm can detect the proposed prosodic unit with reasonable accuracy. The production of audio feedback which is an important component of CALL is studied in this thesis. The learner’s vocal features and the teacher’s linguistic gestures are combined to produce effective feedback utterances which can facilitate the acquisition of English speaking skills. An accent reduction scheme which reduces the perceived accents in the learner’s utterances is studied. A multi-corpora experiment designed to examine effects of external factors on the accent reduction results resolves some ambiguities in the literature. In addition, different speech synthesis methods are described and implemented to perform accent reduction. Voice conversion is also applied as a new method to generate feedback utterances which possess the learner’s vocal features and the teacher’s linguistic gestures. The feedback utterances generated by various accent reduction methods are compared with that produced by voice conversion in order to identify an optimal way to produce feedback utterances with high nativeness and acoustic quality. Consequently, a multi-stage feedback scheme is proposed. Finally, the phonetic segmentation process is studied and its performance is improved to produce more accurate phone boundary information. Such kind of information can contribute to the development of speech technology areas which can be applied to the design of computer-aided language learning systems. Three different refinement methods, i.e., statistical correction, multi-resolution fusion, and predictive model based refinement, are presented. These methods are combined appropriately to improve the accuracy of the baseline phonetic segmentation system using forced alignment. The proposed refinement scheme is also extended to a cross-corpora scenario, which enables the analysis of a new corpus with limited labeled data and thus facilitates the application of the new corpus for various purposes such as speech recognition and linguistic research.
DRNTU::Engineering::Computer science and engineering::Information systems::Information systems applications