Please use this identifier to cite or link to this item:
Title: Machine learning based web page classifier
Authors: Setiawan, Andri
Keywords: DRNTU::Engineering
Issue Date: 2016
Abstract: In recent years, the usage of the Internet has increased tremendously, and the total number of web pages has become enormous. The Internet is accessed by everyone for various purposes, and is growing very rapidly. In 2015, worldwidewebsize estimated there are around 50 billion webpages in the Internet [1]. Web Directories, such as DMOZ ( and Hotfrog, has classified the web pages into a set of categories. This is done to assist internet users and search engine such as Google. Search engine has been known to use the web directory to find and rank the web pages for certain keywords. The largest web directory, DMOZ, is a human-edited directory and has listed around 4 million web pages [2]. Most web directory hires web experts to classify the web pages into different categories, and this approach is not effective because of the rate the internet is growing. Hence, to improve the effectiveness and automate web categorization, some methods related to machine learning and data mining have been researched to categorize the web pages automatically. In this project, the features that was used for the classifier is all related to the HTML structure of the web pages. Most common HTML tags, metadata, and images are extracted based on the HTML document. The classifiers that will be used are Neural Network for Pattern Recognition, and Support Vector Machine. Four classes of web pages are chosen for this project, and those are: Online Store, Internet Forum, News Article, and Blog Article. The web pages are collected manually through Google Search Engine. Furthermore, the final application for this project is to be able to classify a web page by using its URL as an input.
Rights: Nanyang Technological University
Fulltext Permission: restricted
Fulltext Availability: With Fulltext
Appears in Collections:EEE Student Reports (FYP/IA/PA/PI)

Files in This Item:
File Description SizeFormat 
  Restricted Access
1.41 MBAdobe PDFView/Open

Page view(s) 20

Updated on Jun 23, 2021

Download(s) 50

Updated on Jun 23, 2021

Google ScholarTM


Items in DR-NTU are protected by copyright, with all rights reserved, unless otherwise indicated.