Please use this identifier to cite or link to this item:
Title: Web crawler for newspaper text
Authors: Phuah, Chee Chong
Keywords: DRNTU::Engineering::Computer science and engineering
Issue Date: 2015
Abstract: There is a huge collection of news related data available electronically today because of the World Wide Web. Web crawling has provided an avenue for those interested in obtaining these data, and to train language models that can be improved upon as more data are collected. However, every website is developed differently and extracting specific parts of each website for information can result in major rework of the web crawler. Many existing web crawlers today do not facilitate multiple web crawling nor do they specifically allow parts of a web page to be selected. The primary objective of this project is to develop a web crawler that is able to crawl multiple news website with minimal modifications whenever more websites need to be added. This is achieved by realising 4 software quality attributes – reusability, modularity, portability and scalability. The web crawler is developed in Python with external libraries that improve the efficiency and performance of its web crawling process. The web crawler developed is capable of crawling multiple news website in multiple languages (e.g. English, Malay and Vietnamese) with selection policies unique to each website. The selection policies are used for identifying specific links (of where data is to be extracted) and content selection. Data extracted is also stored into XML files with custom tags so that regardless of how differently each website is developed, the extracted content will be in a standardised format after extraction. In addition to the web crawler, a Text Normalisation module was developed separately in this project to examine the quality of the data extracted. The Text Normalisation module normalises text into a format for a language modelling toolkit to train a language model base on the normalised text. The same toolkit is used to test data against the trained language model which produced a perplexity value. The perplexity found for each test data used showed a similar pattern – when the date of the language model moves closer to the date of the test data, the perplexity gradually decreases. The overall perplexity is also found to be lower whenever more data is used to train the language model. The results highlighted the need for relevant and latest data from news websites to train a news type language model.
Rights: Nanyang Technological University
Fulltext Permission: restricted
Fulltext Availability: With Fulltext
Appears in Collections:SCSE Student Reports (FYP/IA/PA/PI)

Files in This Item:
File Description SizeFormat 
  Restricted Access
2.39 MBAdobe PDFView/Open

Google ScholarTM


Items in DR-NTU are protected by copyright, with all rights reserved, unless otherwise indicated.