Please use this identifier to cite or link to this item: https://hdl.handle.net/10356/48647
Title: Enhanced subspace clustering
Authors: Sim, Kelvin Sian Hui.
Keywords: DRNTU::Engineering::Computer science and engineering::Computing methodologies
Issue Date: 2012
Abstract: Subspace clustering overcomes the curse of dimensionality that traditional clustering suffered, by finding groups of objects that are homogeneous in subspaces of the data, instead of the full space. Research on basic subspace clustering over the past decade primary focuses on finding groups of objects that are closed together in subspaces of 2D data. Recently, the proliferation of non-traditional data and the need for higher quality clustering results have shifted the research paradigm to enhanced subspace clustering, which focuses on problems that cannot be handled or solved effectively through basic subspace clustering. The problems of enhanced subspace clustering can be categorized into two main groups, handling non-traditional data and improving clustering results. We give a survey on the enhanced subspace clustering problems, desired properties that these problems sought in their solutions, and the existing solutions. We study three main problems of enhanced subspace clustering on 2D and 3D datasets: mining subspace clusters in noisy data, mining significant subspace clusters and mining semi-supervised subspace clusters. For mining subspace clusters in noisy data, we found several problems of existing approaches, such as mining incomplete and unstable results, lacking the ability to handle 3D data, and mining clusters that are non-maximal and that contain skewed noise. We propose subspace clusters that are maximal and do not contain skewed noise. We also develop algorithms which exploit the anti-monotone property of the clusters to efficiently mine the complete and stable set of results. We show the effectiveness of our solution in mining biologically significant protein clusters in protein-protein interaction data, which is notoriously noisy in nature. For mining significant subspace clusters, we formulate an information theory concept known as correlation information, to measure the significance of the subspace clusters. We propose mining subspace clusters with high correlation information, and we develop an algorithm which uses the concept of rarity to mine significant 3D subspace clusters in a parameter-insensitive way. We show the effectiveness of our solution in finding significant (1) groups of proteins in protein-protein interaction data, (2) clusters of words and documents in word-document data and (3) in classifying an insurance data, where significant clusters are used as rules of the classifier. For mining semi-supervised subspace clusters, we propose actionable subspace clusters, which are semi-supervised subspace clusters that allow incorporation of user's knowledge, and can suggest beneficial actions to the users. We develop algorithms that use augmented Lagrangian multiplier method coupled with frequent itemset mining algorithm to efficiently mine the actionable clusters in a parameter-insensitive way. We show the effectiveness of our solution in finding actionable groups of residues in protein structural data, which are potential binding sites for drug molecules. Lastly, we present a financial data mining application on value investing, and show that our proposed algorithms outperform a famous value investment strategy in 70% of the experiments.
URI: http://hdl.handle.net/10356/48647
Fulltext Permission: restricted
Fulltext Availability: With Fulltext
Appears in Collections:SCSE Theses

Files in This Item:
File Description SizeFormat 
Tsceg0502547b.pdf
  Restricted Access
2.43 MBAdobe PDFView/Open

Page view(s) 50

482
checked on Oct 20, 2020

Download(s) 50

15
checked on Oct 20, 2020

Google ScholarTM

Check

Items in DR-NTU are protected by copyright, with all rights reserved, unless otherwise indicated.