Please use this identifier to cite or link to this item: https://hdl.handle.net/10356/181493
Title: Adapting whisper for phoneme recognition on stroke-impaired speech
Authors: Ong, Hai Xiang
Keywords: Computer and Information Science
Issue Date: 2024
Publisher: Nanyang Technological University
Source: Ong, H. X. (2024). Adapting whisper for phoneme recognition on stroke-impaired speech. Final Year Project (FYP), Nanyang Technological University, Singapore. https://hdl.handle.net/10356/181493
Abstract: Phoneme recognition for impaired speech, such as that affected by stroke-related impairments, presents unique challenges due to phonetic vulnerability and articulation issues. This study investigates the adaptation of Whisper, a large-scale sequence-to-sequence audio model, for PR tasks in this domain. This research applies the Self-Supervised Contrastive Recalibration for Robust Encoding (SCORE) methodology, leveraging its encoder for robust latent representation alignment. The results are compared against SOTA self-supervised learning models, WavLM and HuBERT, which have demonstrated strong performance in clean and noisy speech tasks. Despite the significant relative improvements in phoneme error rate achieved by WavLM and HuBERT using SCORE as reported in prior work, we found that Whisper consistently outperformed these models in PR for impaired speech. Whisper achieved a PER of 26.49%, surpassing the adjusted performance of WavLM and HuBERT even when accounting for hypothetical SCORE-induced gains. These findings suggest that Whisper’s architecture and extensive training on diverse data provide it with superior adaptability for handling speech variability and dysfluencies, highlighting its potential in clinical applications like speech therapy and rehabilitation. This study further explores the impact of layer-freezing strategies on model performance, revealing that unfreezing the top 8 layers in Whisper yields optimal PR results. While the exploration of layer freezing strategy is by no means exhaustive, these insights underscore the importance of architectural suitability, training diversity and task-specific fine-tuning techniques in advancing PR for impaired speech.
URI: https://hdl.handle.net/10356/181493
Schools: College of Computing and Data Science 
Organisations: Imperial College London
Fulltext Permission: restricted
Fulltext Availability: With Fulltext
Appears in Collections:CCDS Student Reports (FYP/IA/PA/PI)

Files in This Item:
File Description SizeFormat 
OFYP.pdf
  Restricted Access
1.05 MBAdobe PDFView/Open

Page view(s)

94
Updated on Mar 27, 2025

Download(s)

12
Updated on Mar 27, 2025

Google ScholarTM

Check

Items in DR-NTU are protected by copyright, with all rights reserved, unless otherwise indicated.