Please use this identifier to cite or link to this item: https://hdl.handle.net/10356/61577
Title: Online text mining
Authors: Tan, Abel Peng Heng
Keywords: DRNTU::Engineering::Computer science and engineering::Computer applications::Administrative data processing
Issue Date: 2014
Abstract: Training language model made from conversational speech is difficult due to large variation of the way conversational speech is made. Deriving the conversational speech through direct transcription is costly and impractical for large corpus. A solution to this is to utilize the text that is easily available on the Internet to improve an existing language model made from broadcast news. In this project, the author developed an automated application capable of mining for text from online source, transforming the data into human speak-able forms through normalization techniques before using them to generate language model for adaption to improve the existing language model. The system developed had mined 16 years of data mounting up to 1.69 GB of new articles text. Through smoothing technique and interpolation weight analysis, the author had improved the perplexity of existing system made from broadcast news text significantly by 48.5%. The previous finding had shown that a larger corpus could improve the perplexity of a language model however there is a constant need to find out better ways to make use of data instead of massively crawling data across the Internet.Thus the second objective of this project is to investigate the effectiveness of improving language model based on latest data. To find out whether constantly crawling for new data is good for improving a language model or the change in perplexity is so small that it is negligible. In this report, the author had conducted 8 experiments using 10 years of past data to gather a baseline perplexity. Each of the experiments had new data added to the base model before a perplexity test is again carried out. The findings showed that despite the new data being only an average of 0.18% of the baseline’straining data. It had an average of 1.9% perplexity improvement.improvement. Therefore the author is able to conclude that new data is of high importance and should always be crawled to improve a language model if its usage patterns changes according to new data. The final experiment of this project had created language models using a variation of vocabulary sizes. Tests on them revealed that an increased in vocabulary size actually increases the perplexity.
URI: http://hdl.handle.net/10356/61577
Schools: School of Computer Engineering 
Research Centres: Emerging Research Lab 
Rights: Nanyang Technological University
Fulltext Permission: restricted
Fulltext Availability: With Fulltext
Appears in Collections:SCSE Student Reports (FYP/IA/PA/PI)

Files in This Item:
File Description SizeFormat 
Amended Final Year Report.pdf
  Restricted Access
1.51 MBAdobe PDFView/Open

Page view(s) 50

554
Updated on Mar 14, 2025

Download(s) 50

41
Updated on Mar 14, 2025

Google ScholarTM

Check

Items in DR-NTU are protected by copyright, with all rights reserved, unless otherwise indicated.