Speaker diarization in meetings domain
Nguyen, Trung Hieu
Date of Issue2015
School of Computer Engineering
Emerging Research Lab
The purpose of this study is to develop robust techniques for speaker segmentation and clustering with focus on meetings domain. The techniques examined can however be applied to any other domains such as telephone and broadcast news. Traditional techniques for speaker diarization developed for telephone conversations or broadcast news are based on a single channel, which is notably different from meetings domain which can have multiple channels. These techniques when adapted to meetings domain however perform poorer than expected since they do not exploit direction of arrival information, which is available in many meeting rooms with the presence of multiple microphones. Moreover, many of these techniques are involved with tunable parameters, which are presumably derived using external data. These parameters need to be individually adjusted for each data set accordingly to obtain reasonable performance. In this thesis, the focus is on robust and accurate speaker diarization techniques in meetings. Our aim is to improve the segmentation and clustering performance in diverse conditions while keeping the number of manually tuned parameters to minimal. Starting from the widely adopted agglomerative hierarchical clustering framework, a comparative study of various distance metrics is conducted for the purpose of finding the most appropriate metric to use for speaker clustering. In contrast to general practice, it is shown that the popular Generalized Likelihood Ratio (GLR) based metrics such as GLR and Bayesian Criterion Information (BIC) should not be used as distance metrics since they are not robust to size variations. As a result, a novel metric is proposed which can be seen as an extension of the Information Change Rate (ICR) by exploiting the second-order statistics of the likelihood scores. The proposed metric is shown to be much less affected by the length of speech segments and the results on diarization tasks show improvements on diarization error rate (DER) of more than 10% relatively comparing to GLR. Having addressed the topic of cluster merging by investigating various distance metrics, this work then suggests robust techniques to tackle the issue of determining the number of clusters. In the suggested methods, two novel metrics are presented to measure the partitioning quality in terms of the separation between two distributions: one distribution for distances between segments of the same speakers and one for the distances between segments of different speakers. Such techniques have been evaluated on the RT07s NIST Rich Transcription evaluations for meetings data sets and competitive performance is achieved, without the need to learn the threshold for estimating the number of clusters as in conventional state-of-the-art systems. Finally, multi-stream speaker clustering approach is studied with the emphasis on assessing the relative significance of each individual stream and as a result, an adaptive weighting scheme for each feature stream is suggested. This adaptive weighting scheme is then shown to perform better than fixed weighting scheme, with the additional benefit of no training data is required to determine the weights. The complete systems: one for single channel and one for multiple channels were submitted to the RT09s NIST Rich Transcription evaluations and achieved the first rank in the speaker diarization category.
DRNTU::Engineering::Computer science and engineering::Information systems::Information systems applications