Please use this identifier to cite or link to this item:
Title: Sentence classification of online drug reviews using machine learning techniques
Authors: Sukumar Warrier, Vinay
Keywords: DRNTU::Library and information science
DRNTU::Engineering::Computer science and engineering::Information systems
DRNTU::Humanities::Linguistics::Sociolinguistics::Computational linguistics
Issue Date: 2016
Abstract: Background: Adverse drug effects form a vital part of pharmacovigilance. With the advent of Web 2.0, online drug review sites with voluntary reporting and unrestricted data access offer a competitive alternative for mining drug side effects, in comparison to traditional health records and medical case reports. Objectives: This study aims to develop text classification models using machine learning to categorize sentences in online drug reviews into 3 categories: sentences with side effect information, sentences with negative attitude towards the drug, and sentences indicating a positive effect/ drug efficacy. The secondary objective is to attempt bootstrapping a training corpus of unlabelled review sentences and train classifier models with it. Methods: 3 undergraduate coders were tasked with annotating 1000 randomly selected reviews from a WebMD based online drug review corpus at the sentence level. Feature development is carried out from the sentences/reviews based on length, position, sentiment, matches with a side-effect dictionary and cues indicating side effect. Linguistic features like unigrams, bigrams, and trigrams are also extracted. 70% of the labelled corpus forms the training set and multiple classifier models are tested based on logistic regression and support vector machine to predict the three target categories. Bootstrapping is carried out using a rule-based seed labelling system and used to expand on the unlabelled data. The bootstrapped training data is used to classify the labelled test corpus. Results: Logistic regression produced the best model for sentence category ‘Side Effect’ with an F-measure of 0.63. Side-effect dictionary terms along with sentiment values were among the top significant predictors. Support vector machine (SVM) classifiers produced the best models for ‘Negative Sentiment/Side Effect’ (F-measure: 0.60) and ‘Effective, Positive’ (F-measure: 0.64) categories respectively. Random forest classifier presented the best model for predicting ‘Side Effect’ using bootstrapped labels (F-measure: 0.45). Apart from dictionary based features, sentence position, length, and sentiment score played important roles in all the models.
Rights: Nanyang Technological University
Fulltext Permission: restricted
Fulltext Availability: With Fulltext
Appears in Collections:WKWSCI Theses

Files in This Item:
File Description SizeFormat 
  Restricted Access
1.24 MBAdobe PDFView/Open

Page view(s)

Updated on May 12, 2021


Updated on May 12, 2021

Google ScholarTM


Items in DR-NTU are protected by copyright, with all rights reserved, unless otherwise indicated.