Please use this identifier to cite or link to this item:
Title: Bag-of-Concepts representation for document classification based on automatic knowledge acquisition from probabilistic knowledge base
Authors: Li, Pengfei
Mao, Kezhi
Xu, Yuecong
Li, Qi
Zhang, Jiaheng
Keywords: Engineering::Computer science and engineering
Issue Date: 2020
Source: Li, P., Mao, K., Xu, Y., Li, Q., & Zhang, J. (2020). Bag-of-Concepts representation for document classification based on automatic knowledge acquisition from probabilistic knowledge base. Knowledge-Based Systems, 193105436-. doi:10.1016/j.knosys.2019.105436
Journal: Knowledge-Based Systems
Abstract: Text representation, a crucial step for text mining and natural language processing, concerns about transforming unstructured textual data into structured numerical vectors to support various machine learning and data mining algorithms. For document classification, one classical and commonly adopted text representation method is Bag-of-Words (BoW) model. BoW represents document as a fixed-length vector of terms, where each term dimension is a numerical value such as term frequency or tf-idf weight. However, BoW simply looks at surface form of words. It ignores the semantic, conceptual and contextual information of texts, and also suffers from high dimensionality and sparsity issues. To address the aforementioned issues, we propose a novel document representation scheme called Bag-of-Concepts (BoC), which automatically acquires useful conceptual knowledge from external knowledge base, then conceptualizes words and phrases in the document into higher level semantics (i.e. concepts) in a probabilistic manner, and eventually represents a document as a distributed vector in the learned concept space. By utilizing background knowledge from knowledge base, BoC representation is able to provide more semantic and conceptual information of texts, as well as better interpretability for human understanding. We also propose Bag-of-Concept-Clusters (BoCCl) model which clusters semantically similar concepts together and performs entity sense disambiguation to further improve BoC representation. In addition, we combine BoCCl and BoW representations using an attention mechanism to effectively utilize both concept-level and word-level information and achieve optimal performance for document classification.
ISSN: 0950-7051
DOI: 10.1016/j.knosys.2019.105436
Rights: © 2020 Elsevier. All rights reserved. This paper was published in Knowledge-Based Systems and is made available with permission of Elsevier.
Fulltext Permission: embargo_20221231
Fulltext Availability: With Fulltext
Appears in Collections:EEE Journal Articles

Files in This Item:
File Description SizeFormat 
  Until 2022-12-31
final manuscript1.86 MBAdobe PDFUnder embargo until Dec 31, 2022

Citations 20

Updated on Mar 10, 2021

Citations 20

Updated on Mar 5, 2021

Page view(s)

Updated on May 18, 2022

Download(s) 20

Updated on May 18, 2022

Google ScholarTM




Items in DR-NTU are protected by copyright, with all rights reserved, unless otherwise indicated.