Clustering and heterogeneous information fusion for social media theme discovery and associative mining
Date of Issue2014
School of Computer Engineering
The emergence of social networking web sites has created numerous interactive sharing platforms for users to upload, comment, and share multimedia content online within their social circles. It has led to the massive number of web multimedia documents, together with their rich meta-information, such as category information, user tagging and description, and user comments. Such interconnected but heterogeneous social media data has provided opportunities for understanding traditional multimedia data, such as images and text documents. More importantly, the different types of activities and interactions of social users could be utilized to understand and analyze user behaviors, and discover social trends in social networks. Clustering is an important approach to the analysis and mining of social media data. However, different from traditional multimedia data, the social media data are typically massive, diverse, heterogeneous and noisy. Those characteristics of social media data raise new challenges for existing clustering techniques, including the scalability to big data, the ability to automatically recognize the number of clusters in data sets, the strategies to effectively integrate data from heterogeneous resources for clustering, and the robustness to noisy features. Moreover, considering that different social users may have different preferences for categorizing the social media data, incorporating user preferences into the clustering framework to produce personalized data clusters is also a challenge. In order to address the above issues, in this thesis, we investigate and develop novel clustering algorithms for the fast and robust clustering of large-scale social media data by integrating their multiple but different types of features and user preferences, and explore their applications to the associative social media mining tasks. Towards this goal, we have completed four key tasks. First, we developed a two-step semi-supervised hierarchical clustering algorithm, termed Personalized Hierarchical Theme-based Clustering (PHTC), for personalized web image organization by exploiting the surrounding text of web images. Our experiments have shown that PHTC can identify high quality clusters of web images under user supervision using the proposed semi-supervised clustering algorithm, called Probabilistic Fusion Adaptive Resonance Theory (PF-ART). In addition, it can order the clusters into a systematical hierarchy with a higher quality and lower time cost than several existing hierarchical clustering algorithms. Secondly, we proposed a semi-supervised heterogeneous data co-clustering algorithm, termed Generalized Heterogeneous Fusion Adaptive Resonance Theory (GHF-ART), for multimedia data co-clustering by integrating different types of features from inter-related but heterogeneous data resources and user preferences. Compared with existing approaches, GHF-ART has the advantages of strong noise immunity, adaptive feature weighting, low computational cost, and incremental clustering in handling the dynamic social media data. Thirdly, we investigated the feasibility of GHF-ART to clustering social network data for discovering user communities in heterogeneous social networks, and demonstrated its capability for analyzing the correlation among different social links and mining the potential themes of user communities. Lastly, we studied the geometrical dynamics of Fuzzy ART and proposed three methods to adapt the vigilance parameter of Fuzzy ART. This leads to clustering algorithms insensitive to the input parameters for dealing with large and complex social media data. Our experiments have demonstrated the effectiveness of the proposed methods. Furthermore, the geometrical study of Fuzzy ART may also benefit further research. While our completed studies has provided the base technologies for social media mining, the future directions for this thesis may focus on the following aspects: 1) Modeling of short and noisy text; 2) Automated selection of vigilance parameter in Fuzzy ART; 3) Improvement of clustering mechanism of ART; 4) Extension work on multimedia data indexing, annotation and retrieval; 5) Exploiting temporal factor for multimedia data storage and mining; and 6) Associative applications to social media mining tasks.
DRNTU::Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence