Please use this identifier to cite or link to this item:
Title: Domain-agnostic document and question classification using natural language processing techniques
Authors: Supraja, S.
Keywords: Engineering::Electrical and electronic engineering
Issue Date: 2022
Publisher: Nanyang Technological University
Source: Supraja, S. (2022). Domain-agnostic document and question classification using natural language processing techniques. Doctoral thesis, Nanyang Technological University, Singapore.
Abstract: This thesis addresses the classification of documents and questions to domain-agnostic class labels. Domain refers to the subject matter with which the class labels are associated. Domain-specific document or question classification is commonly applied in articles categorization or in factoid question answering with class labels being defined by subject matter. For instance, considering digital signal processing (DSP) questions, the explicit meaning of the questions will be reflected if the domain-specific class labels consist of Fourier Transform or z-transform. In contrast, applications for domain-agnostic document classification include classifying job descriptions into generic skillsets, scientific statements into section types, and sentences into argumentative zone functions. With questions possessing different characteristics, domain-agnostic question classification is applied in information query or dialogue interactions in which the class labels may comprise question types or reasoning capabilities. To enhance the effectiveness of deliberate practice, questions are classified into their respective cognitive complexities for instructors to determine learners’ proficiencies. Quite often, in scenarios where the size of the question bank is limited, statistical approaches are adopted for feature extraction. Since domain-agnostic classification takes the implicit substance of a text into account (e.g., learning outcome of the same DSP question irrespective of the content), it relies on a suitable feature extraction process. This thesis explores the use of topic modeling techniques as feature extractors for questions due to its ability of offering linguistic insights into language patterns by grouping associated words into topics and, thereafter, computing the probabilities of topics occurring in each document. Considering the limitations of employing baseline topic modeling algorithms for automatic question classification (AQC), an algorithm that observes the effect of pre-processing procedures and word co-occurrence redundancy is proposed. However, the limitation of this method is that it is dataset-specific and requires hand-curated word tagging. To address these shortcomings, a new holistic generalizable regularized phrase-based topic modeling technique is proposed. This technique is driven by the fact that phrases have been shown to be more effective than words to represent questions. Further elements such as nested regular expressions and scaling parameters are being employed to facilitate efficient mapping of questions to class labels. For documents, the baseline algorithm of graph networks is adopted. This thesis shows that graph networks are suitable since it is important to establish the relationships between documents to better classify them into domain-agnostic categories. In addition, graphs encompass a global perspective compared to conventional deep learning techniques that are both localized and sequential. In the proposed quad-faceted feature-based graph network, this thesis shows that the addition of a new topical layer is vital for observing the impact of topic modeling on generating a meaningful set of features. It also highlights that the use of regular expressions with a domain-agnostic nature is important for co-occurrence statistics while the meaning of a document encapsulated via phrase nodes are crucial for semantic relationships.
DOI: 10.32657/10356/157159
Schools: School of Electrical and Electronic Engineering 
Rights: This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0).
Fulltext Permission: open
Fulltext Availability: With Fulltext
Appears in Collections:EEE Theses

Files in This Item:
File Description SizeFormat 
Thesis_S Supraja_G1601128E_04052022.pdfPh.D. (EEE) Thesis - S Supraja6.05 MBAdobe PDFThumbnail

Page view(s)

Updated on Dec 8, 2023

Download(s) 50

Updated on Dec 8, 2023

Google ScholarTM




Items in DR-NTU are protected by copyright, with all rights reserved, unless otherwise indicated.