Please use this identifier to cite or link to this item: https://hdl.handle.net/10356/1814
Title: Using domain knowledge to improve the quality of query clusters.
Authors: Tan, Swee Peng.
Keywords: DRNTU::Library and information science::Libraries::Information retrieval and analysis
Issue Date: 2007
Abstract: Our study hypothesizes that incorporating linguistic knowledge and domain knowledge into query clustering can improve to some extent the quality of query clusters. In our research, we used a database of six month’s worth of query log from the Nanyang Technological University digital library. Using the Wordnet lexical database, each query term in the query log was replaced with corresponding synonym synsets. These synonym synsets are identified as features which are different from the contentbased approach where features are constructed from the query terms. The synsets were weighted to reflect their importance in a query and similarities between pairs of queries computed using the cosine similarity measure. Our clustering algorithm placed two queries in the same cluster whenever the similarity between them exceeded a certain threshold. In this way, clusters were created for four different thresholds to facilitate comparison between them. The quality of the clusters were evaluated using five different performance measures of average cluster size, coverage, precision, recall and the F-measure against the judgments of two human evaluators on a sample of clusters. A comparison of the current study and previous study conducted by Chandrani (2004) show that the performance measures were lower at all the four thresholds in terms of coverage, precision, recall and F-measure. We identified two key reasons for the lower values in these performance measures due to the additional preprocessing that reduced the query log size and also the clusters formed were mainly engineering related subjects. The evaluation is further extended to incorporate domain knowledge element into the evaluators. The three performance measures were computed in terms of the average cluster size, coverage and precision and the results were compared with the current study and the previous study conducted by Chandrani (2004). Overall, there is an improvement in terms of precision contributed by the importance of domain knowledge of the evaluators. We propose that further preprocessing and finding ways to extract the elements of domain knowledge to feed into the clustering process can significantly improve the precision, which is left as future work.
URI: http://hdl.handle.net/10356/1814
Schools: Wee Kim Wee School of Communication and Information 
Rights: Nanyang Technological University
Fulltext Permission: restricted
Fulltext Availability: With Fulltext
Appears in Collections:WKWSCI Theses

Files in This Item:
File Description SizeFormat 
WKWSCI_THESES_450.pdf
  Restricted Access
545.48 kBAdobe PDFView/Open

Page view(s) 50

523
Updated on Oct 8, 2024

Download(s)

3
Updated on Oct 8, 2024

Google ScholarTM

Check

Items in DR-NTU are protected by copyright, with all rights reserved, unless otherwise indicated.