Please use this identifier to cite or link to this item: https://hdl.handle.net/10356/54272
Title: Clustering techniques for web documents
Authors: Pan, Tianchi
Keywords: DRNTU::Engineering
Issue Date: 2013
Abstract: Document clustering is a process of grouping documents into several natural and homogeneous clusters so that documents within the same cluster are more similar to each other than those belonging to other clusters [1]. While in the web environment, task seems more challenging. Essential clustering techniques need to be employed to facilitate the discovery knowledge in this process. K-means is one of the frequently used methods in data clustering; however, it will fail to find out the meaningful clustering result if input data is given in a less structured way. Therefore, in this report a new learning distance metric proposed by Eric P. Xing is implemented with supplementary side information to help improving the K-means clustering performance. New algorithm will be studied in details and validated on different datasets and its performance will be evaluated by some quantitative values: NMI, purity and random index using Java as well as cluster visualization using MATLAB. From the results obtained, we have found that new clustering algorithm has shown a pleasant improvement compared with the original one and might be used for future application in data clustering.
URI: http://hdl.handle.net/10356/54272
Rights: Nanyang Technological University
Fulltext Permission: restricted
Fulltext Availability: With Fulltext
Appears in Collections:EEE Student Reports (FYP/IA/PA/PI)

Files in This Item:
File Description SizeFormat 
eA3050121.pdf
  Restricted Access
Main article1.18 MBAdobe PDFView/Open

Page view(s) 50

163
checked on Sep 26, 2020

Download(s) 50

12
checked on Sep 26, 2020

Google ScholarTM

Check

Items in DR-NTU are protected by copyright, with all rights reserved, unless otherwise indicated.