Sentence unit detection using lexical information for automatic speech recognition transcripts
Ho, Thi Nga
Date of Issue2018-12-31
School of Computer Science and Engineering
This thesis studies Sentence Unit Detection (SUD) that uses lexical information for Automatic Speech Recognition (ASR) transcripts. An SUD system detects the presence or absence of a sentence unit boundary at every word boundaries of an unpunctuated text document. A transcript produced by an ASR system from a given speech audio input is an unpunctuated text document. Hence, SUD for ASR transcripts will increase the readability and enable the usability of ASR transcripts in downstream text processing applications, such as machine translation and option mining, which require sentence unit boundary information in their input. As usually being a post-processing application for ASR transcripts, an SUD system may benefit from two major information sources, i.e., the prosodic information obtained from the audio input and the lexical information obtained from the ASR transcripts, for its model learning. In this thesis, we chose to use only lexical information because it has shown to be essential for SUD research as compared to the other. Moreover, using lexical information takes an advantage of the massive available lexical data sources from the Internet. Additionally, SUD models that use only lexical information are usually much less complex as compared to the models that employ multiple information sources. To improve an SUD system performance, one possible direction is to improve the quality of input features. Previous studies usually paid more attention on finding new input features to enrich and enlarge the existing feature space rather than refining it for a better quality. This tendency might lead to feature redundancy and unnecessarily increase the model complexity. This thesis addresses the lack in research by studying approaches that fine tune existing lexical features to SUD systems. Specifically, there are two proposed approaches: The first approach focuses on optimizing the existing features to reduces the input feature space size. The second study focuses on evaluating the usefulness of many variants of the same feature type to the performance of a SUD system. The first study proposes a feature selection method to optimize the distant-bigram features as the input to a CRF based SUD system. The target is to reduce the system's model complexity but maintaining a comparable system performance. The study uses Pointwise Mutual Information (PMI) as a feature selection method, to select only informative features for model training, to reduce training cost and model complexity. PMI is used to measure the correlation of the input features to the presence of the sentence unit boundary at the hypothesized word boundary. The obtained PMI values are then used as selection criteria, i.e. the higher the PMI value obtained for a feature, the higher the chance that the feature is selected for model training. The proposed method significantly reduces the size of the input feature space by 44.87% relatively while still maintaining comparable performance to the original model which has no feature selection. CRF is known to be a shallow learning technique which usually heavily relies on feature extraction engineering to achieve a good prediction performance. Thus it is reasonable to get a better result in term of model complexity and performance from fine tuning the input feature space. However, the recent increase in popularity of using deep learning in SUD research has shown to be less dependent on feature extraction engineering. Thus, it is questionable if the fine tuning on the existing input features will still be helpful. To answer this, our second study proposes a framework to evaluate the effectiveness of using different variants of the same feature type, i.e., word embedding, as input to train a deep learning based SUD model. Specifically, two word embedding variants, i.e., word-based and subword-based skip-gram, are trained using either a small but in-domain dataset or a large but out-of-domain dataset, resulting four variants of word embedding models, i.e., in-domain word-based, out-of-domain word-based, in-domain subword-based and out-of-domain subword-based. Subsequently, each of the four word embedding models is used to extract embedding feature as an input to a deep neural network with biLSTM and fully connected layers to train separate SUD models. Our study reveals that the in-domain subword-based embedding gives the best performance in all testing conditions. On the other hand, the word-based embedding model is recommended to use in the case that there is no in-domain dataset available for word embedding model training.
DRNTU::Engineering::Computer science and engineering