Please use this identifier to cite or link to this item:
Full metadata record
DC FieldValueLanguage
dc.contributor.authorXue, Boyangen_US
dc.contributor.authorHu, Shoukangen_US
dc.contributor.authorXu, Junhaoen_US
dc.contributor.authorGeng, Mengzheen_US
dc.contributor.authorLiu, Xunyingen_US
dc.contributor.authorMeng, Helenen_US
dc.identifier.citationXue, B., Hu, S., Xu, J., Geng, M., Liu, X. & Meng, H. (2022). Bayesian neural network language modeling for speech recognition. IEEE/ACM Transactions On Audio Speech and Language Processing, 30, 2900-2917.
dc.description.abstractState-of-the-art neural network language models (NNLMs) represented by long short term memory recurrent neural networks (LSTM-RNNs) and Transformers are becoming highly complex. They are prone to overfitting and poor generalization when given limited training data. To this end, an overarching full Bayesian learning framework encompassing three methods is proposed in this paper to account for the underlying uncertainty in LSTM-RNN and Transformer LMs. The uncertainty over their model parameters, choice of neural activations and hidden output representations are modeled using Bayesian, Gaussian Process and variational LSTM-RNN or Transformer LMs respectively. Efficient inference approaches were used to automatically select the optimal network internal components to be Bayesian learned using neural architecture search. A minimal number of Monte Carlo parameter samples as low as one was also used. These allow the computational costs incurred in Bayesian NNLM training and evaluation to be minimized. Experiments are conducted on two tasks: AMI meeting transcription and Oxford-BBC LipReading Sentences 2 (LRS2) overlapped speech recognition using state-of-the-art LF-MMI trained factored TDNN systems featuring data augmentation, speaker adaptation and audio-visual multi-channel beamforming for overlapped speech. Consistent performance improvements over the baseline LSTM-RNN and Transformer LMs with point estimated model parameters and drop-out regularization were obtained across both tasks in terms of perplexity and word error rate (WER). In particular, on the LRS2 data, statistically significant WER reductions up to 1.3% and 1.2% absolute (12.1% and 11.3% relative) were obtained over the baseline LSTM-RNN and Transformer LMs respectively after model combination between Bayesian NNLMs and their respective baselines.en_US
dc.relation.ispartofIEEE/ACM Transactions on Audio Speech and Language Processingen_US
dc.rights© 2022 IEEE. All rights reserved.en_US
dc.subjectEngineering::Computer science and engineeringen_US
dc.titleBayesian neural network language modeling for speech recognitionen_US
dc.typeJournal Articleen
dc.contributor.schoolSchool of Computer Science and Engineeringen_US
dc.subject.keywordsBayesian Learningen_US
dc.subject.keywordsModel Uncertaintyen_US
dc.description.acknowledgementThis work was supported in part by Hong Kong Research Council GRF under Grants 14200218, 14200220, and 14200021 and in part by Innovation and Technology Fund under Grants ITS/254/19 and InP/057/21.en_US
item.fulltextNo Fulltext-
Appears in Collections:SCSE Journal Articles


Updated on Feb 2, 2023

Page view(s)

Updated on Feb 5, 2023

Google ScholarTM




Items in DR-NTU are protected by copyright, with all rights reserved, unless otherwise indicated.