Please use this identifier to cite or link to this item: https://hdl.handle.net/10356/164956
Title: Non-reference speech quality assessment based on deep learning
Authors: Fang, Xuhui
Keywords: Engineering::Electrical and electronic engineering
Issue Date: 2023
Publisher: Nanyang Technological University
Source: Fang, X. (2023). Non-reference speech quality assessment based on deep learning. Master's thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/164956
Abstract: In the field of speech processing, voice quality evaluation is one of the important techniques, and it has been widely used in mobile communications, Internet, public safety, digital entertainment, consumer electronics, and other fields. In the early days, there was only subjective voice quality assessment, but it required large human resources, annotated data and time. Hence, objective voice quality evaluation methods gradually became popular. Referenced speech quality assessment models require pure and raw speech signals, which are sometimes difficult to obtain in practice. As a result, the reference speech quality assessment method has received increased attention, especially in recent years. Many experts and researchers have integrated deep learning technology into reference speech quality assessment, which has made a major breakthrough in this field. However, the existing deep learning-based speech quality evaluation still has limitations such as insufficient accuracy and large number of parameters. In order to address these limitations, this dissertation studies the non-reference speech quality evaluation method based on deep learning, and the main research is summarized below: (1) Considering the problem that the accuracy of existing voice quality assessment is not high enough, this dissertation proposes an improvement method from multiple perspectives. This includes the use of BiLSTM(Bidirectional Long Short-Term Memory) to improve the time-dependent model, fully exploiting the ability of BiLSTM to effectively learn the speech context information. On this basis, the Squeeze-and-Excitation (SE) module is added to screen out the attention of the channels by learning the correlation between different channels in the feature map, so as to perform feature calibration on the feature map. In addition, a custom loss function based on the signal loss ratio is used to improve model fitting, which further improves the evaluation performance of the model. Experimental results show the effectiveness of this method. (2) For the problem that the existing speech quality evaluation model has a large number of parameters, we propose a low-complexity speech quality evaluation method based on depthwise residual convolution and Bidirectional Gate Recurrent Unit (BiGRU), the SE-DSResBGRU-NRSQA model\cite{CNN41}. The main goal of this model is to reduce the number of parameters, by using BiGRU and depthwise separable convolution, optimizing the convolution part with the main structure of residual network (ResNet), and using shallow feature information to improve the evaluation performance through direct mapping. On this basis, SE modules are added to learn the importance of different channels, so as to effectively exploit the input information and improve the evaluation performance of the system. From the experimental results, it can be seen that the proposed method can achieve good speech quality evaluation while the number of parameters is relatively small.
URI: https://hdl.handle.net/10356/164956
Schools: School of Electrical and Electronic Engineering 
Fulltext Permission: restricted
Fulltext Availability: With Fulltext
Appears in Collections:EEE Theses

Files in This Item:
File Description SizeFormat 
FANG XUHUI-dissertation.pdf
  Restricted Access
2.01 MBAdobe PDFView/Open

Page view(s)

144
Updated on Feb 28, 2024

Download(s)

12
Updated on Feb 28, 2024

Google ScholarTM

Check

Items in DR-NTU are protected by copyright, with all rights reserved, unless otherwise indicated.