Please use this identifier to cite or link to this item:
https://hdl.handle.net/10356/62888
Title: | Developing web crawler and categorization of newspaper text | Authors: | Singh, Rakhi | Keywords: | DRNTU::Engineering::Computer science and engineering::Information systems::Information storage and retrieval DRNTU::Engineering::Computer science and engineering::Computing methodologies::Document and text processing |
Issue Date: | 2015 | Abstract: | The automated categorization (or classification) of texts into predefined categories has witnessed a booming interest in the last ten years, due to the increased availability of documents in digital form on World Wide Web like online newspaper, magazines, catalogues, blogs, video transcripts, etc. Existing supervised machine-learning based text classification models available in this field faces the challenge of needing large corpus/dataset of labelled data to train the language models. An innovative approach to this problem is to utilize the already classified/categorised news articles that are easily available on the internet. For the scope of this project an English modular text crawler that can be extended to multiple languages and is capable of automatically crawling online newspaper archives, extracting new keywords, and categories is developed. The corpus is further smoothened and transformed into human-speakable forms by using appropriate language-specific normalisation techniques. The crawler has mined over 1.16GB of data ranging from 2006-2012. This normalised corpus is used to build bi-gram probability based statistical language models for each category. These single-label paradigm classifiers are then combined together to form a text classification model. A document can be assigned to multiple categories with certain degree of ranking, but in this project primary focus is on assigning the most probable category to each news article based on the lowest perplexity value (highest similarity). The classification model, built is more robust than most of its counterparts currently available. The system shows a high average accuracy rate of 99.37%, and an average precision of 98.75%, when perplexity tests were conducted with randomly chosen articles | URI: | http://hdl.handle.net/10356/62888 | Schools: | School of Computer Engineering | Research Centres: | Emerging Research Lab | Rights: | Nanyang Technological University | Fulltext Permission: | restricted | Fulltext Availability: | With Fulltext |
Appears in Collections: | SCSE Student Reports (FYP/IA/PA/PI) |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
RakhiSingh-FYP Report.pdf Restricted Access | Main Article | 1.76 MB | Adobe PDF | View/Open |
Page view(s)
413
Updated on Mar 14, 2025
Download(s) 50
23
Updated on Mar 14, 2025
Google ScholarTM
Check
Items in DR-NTU are protected by copyright, with all rights reserved, unless otherwise indicated.