Please use this identifier to cite or link to this item: https://hdl.handle.net/10356/62824
Title: Lexical knowledge-based machine learning method for sentiment analysis
Authors: Heng, Lai Xiang
Keywords: DRNTU::Engineering::Computer science and engineering::Information systems
Issue Date: 2015
Abstract: Before doing any sentiment analysis or classifications, one would need labelled reviews (either a positive or negative sentiment) to do further data mining or natural language processing. Labelling of reviews are done manually and are usually time-consuming and demanding. In this paper, we proposed a new learning algorithm, which is to combine supervised learning with the pre-compiled opinion lexicons. Using this algorithm, manpower and time needed are greatly reduced as it will not require manually labelling of reviews. For this project, customers’ reviews on restaurants will be used from the rich pool of Yelp dataset. There are a total of five steps to the new algorithm: 1) Building two pseudo positive and negative documents. 2) Computation on the pairwise document similarity between the review documents and the positive and negative documents using either the Cosine Similarity or Euclidean Distance approach. 3) Labelling the reviews to either a positive or negative sentiment based on the similarity results. 4) Rank the reviews. 5) Selecting top 2,000 reviews, each 1,000 from the positive and negative labelled documents for sentiment classification model building. In this experiment, we looked into both Naïve Bayes and Support Vector Machine (SVM) classifiers. Three different feature extraction methods namely bag of words model, bag of words model with stopwords removed and using of significant bigrams are used for training the classifier. Out of the three, the use of significant bigrams performed the best by achieving 67% in accuracy whereas the bag of words model performed the worst for Naïve Bayes classifier. On the other hand, SVM classifier performs well in both bag of words model and bag of words model with stopwords removed, achieving an accuracy of about 99%. However, this may indicate an overfitting due to the large sparse of features. Nevertheless, this experiment shows that the automation system of labelling the reviews is possible and it is one step closer in achieving to the goal.
URI: http://hdl.handle.net/10356/62824
Rights: Nanyang Technological University
Fulltext Permission: restricted
Fulltext Availability: With Fulltext
Appears in Collections:SCSE Student Reports (FYP/IA/PA/PI)

Files in This Item:
File Description SizeFormat 
SCE14-0082_Final_Year_Project_Report.pdf
  Restricted Access
1.7 MBAdobe PDFView/Open

Page view(s)

136
Updated on Apr 15, 2021

Download(s) 50

16
Updated on Apr 15, 2021

Google ScholarTM

Check

Items in DR-NTU are protected by copyright, with all rights reserved, unless otherwise indicated.