Please use this identifier to cite or link to this item: https://hdl.handle.net/10356/168498
Title: Enhancing spoken language identification and diarization for multilingual speech
Authors: Liu, Hexin
Keywords: Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence
Engineering::Electrical and electronic engineering
Issue Date: 2023
Publisher: Nanyang Technological University
Source: Liu, H. (2023). Enhancing spoken language identification and diarization for multilingual speech. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/168498
Abstract: Spoken language identification (LID) refers to the automatic process of determining the identity of the language spoken in a speech signal. It has been widely employed as preprocessing in multilingual speech signal processing systems. While existing approaches have shown high performance for general LID, it is still challenging to perform well on speech of various durations. In addition, general LID methods only employ a single type of language cue. Since language cues depict language information from different perspectives, the incorporation of language cues is expected to exhibit higher performance compared to using a single language cue. Therefore, in this thesis, an x-vector self-attention LID (XSA-LID) model is proposed to achieve robustness to speech duration. Two approaches are next introduced to improve and incorporate the language cues, respectively. Finally, LID is performed in a more complex scenario—language diarization via an end-to-end LD model. To achieve robustness against performance degradation due to varying duration, a dual-mode framework on the XSA-LID model with knowledge distillation (KD) is proposed. The dual-mode XSA-LID model is trained by jointly optimizing both the full and short modes with their respective inputs being the full-length speech and its short clip extracted by a specific Boolean mask before KD is applied to further boost the performance on short utterances. In addition, the impact of clip-wise linguistic variability and lexical integrity for LID is investigated by analyzing the variation of LID performance in terms of the lengths and positions of the mimicked speech clips. To enhance LID from the perspective of language cues, two methods are next introduced through which language cues can be utilized efficiently and effectively. The first method investigated efficient methods to compute reliable representations and discard redundant information for LID using a pre-trained multilingual wav2vec 2.0 model. To determine an optimal basic system, the performance of the wav2vec features extracted from the different inner layers of the context network are compared. For this approach, the XSA-LID model forms the backbone used to discriminate between distinct languages. Two mechanisms are then employed to reduce the irrelevant information of the representations in LID—the first being the attentive squeeze-and-excitation (SE) block for dimension-wise scaling and the second being the linear bottleneck (LBN) block that reduces the irrelevant information by nonlinear dimension reduction. These two methods are incorporated within the XSA-LID model, named AttSE-XSA and LBN-XSA respectively. In the second approach, a novel LID model is proposed to hierarchically incorporate phoneme and phonotactic information without requiring phoneme annotations for training. In this model, named PHO-LID, a self-supervised phoneme segmentation task and a LID task share a convolutional neural network (CNN) module, which encodes both language identity and sequential phonemic information in the input speech to generate an intermediate sequence of “phonotactic” embeddings. These embeddings are then fed into transformer encoder layers for utterance-level LID. This architecture is called CNN-Trans. Finally, LID is extended for a code-switching scenario language diarization. In this work, two end-to-end neural configurations are proposed for language diarization on bilingual code-switching speech. The first, a BLSTM-E2E architecture, includes a set of stacked bidirectional LSTMs to compute embeddings and incorporates the deep clustering loss to enforce grouping of languages belonging to the same class. The second, an XSA-E2E architecture, is based on an x-vector model followed by a self-attention encoder. The former encodes frame-level features into segment-level embeddings while the latter considers all those embeddings to generate a sequence of segment-level language labels. All proposed approaches are evaluated on standard datasets including NIST LRE 2017, OLR, SEAME, and WSTCSMC 2020. Compared with the baseline systems, the proposed approaches exhibit significant performance improvement on their corresponding language identification and diarization tasks.
URI: https://hdl.handle.net/10356/168498
DOI: 10.32657/10356/168498
Schools: School of Electrical and Electronic Engineering 
Rights: This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0).
Fulltext Permission: open
Fulltext Availability: With Fulltext
Appears in Collections:EEE Theses

Files in This Item:
File Description SizeFormat 
Hexin_Thesis.pdf4.48 MBAdobe PDFThumbnail
View/Open

Page view(s)

222
Updated on May 25, 2024

Download(s) 50

152
Updated on May 25, 2024

Google ScholarTM

Check

Altmetric


Plumx

Items in DR-NTU are protected by copyright, with all rights reserved, unless otherwise indicated.