Please use this identifier to cite or link to this item:
https://hdl.handle.net/10356/181493
Title: | Adapting whisper for phoneme recognition on stroke-impaired speech | Authors: | Ong, Hai Xiang | Keywords: | Computer and Information Science | Issue Date: | 2024 | Publisher: | Nanyang Technological University | Source: | Ong, H. X. (2024). Adapting whisper for phoneme recognition on stroke-impaired speech. Final Year Project (FYP), Nanyang Technological University, Singapore. https://hdl.handle.net/10356/181493 | Abstract: | Phoneme recognition for impaired speech, such as that affected by stroke-related impairments, presents unique challenges due to phonetic vulnerability and articulation issues. This study investigates the adaptation of Whisper, a large-scale sequence-to-sequence audio model, for PR tasks in this domain. This research applies the Self-Supervised Contrastive Recalibration for Robust Encoding (SCORE) methodology, leveraging its encoder for robust latent representation alignment. The results are compared against SOTA self-supervised learning models, WavLM and HuBERT, which have demonstrated strong performance in clean and noisy speech tasks. Despite the significant relative improvements in phoneme error rate achieved by WavLM and HuBERT using SCORE as reported in prior work, we found that Whisper consistently outperformed these models in PR for impaired speech. Whisper achieved a PER of 26.49%, surpassing the adjusted performance of WavLM and HuBERT even when accounting for hypothetical SCORE-induced gains. These findings suggest that Whisper’s architecture and extensive training on diverse data provide it with superior adaptability for handling speech variability and dysfluencies, highlighting its potential in clinical applications like speech therapy and rehabilitation. This study further explores the impact of layer-freezing strategies on model performance, revealing that unfreezing the top 8 layers in Whisper yields optimal PR results. While the exploration of layer freezing strategy is by no means exhaustive, these insights underscore the importance of architectural suitability, training diversity and task-specific fine-tuning techniques in advancing PR for impaired speech. | URI: | https://hdl.handle.net/10356/181493 | Schools: | College of Computing and Data Science | Organisations: | Imperial College London | Fulltext Permission: | restricted | Fulltext Availability: | With Fulltext |
Appears in Collections: | CCDS Student Reports (FYP/IA/PA/PI) |
Page view(s)
94
Updated on Mar 27, 2025
Download(s)
12
Updated on Mar 27, 2025
Google ScholarTM
Check
Items in DR-NTU are protected by copyright, with all rights reserved, unless otherwise indicated.