Please use this identifier to cite or link to this item: https://hdl.handle.net/10356/48540
Title: Hadoop on data analytics
Authors: Gee, Denny Jee King.
Keywords: DRNTU::Engineering::Computer science and engineering::Computing methodologies::Document and text processing
Issue Date: 2012
Abstract: Twitter is a micro-blogging application that entails strong capabilities to share and convey ideas effectively through social connectivity. The data sets generated by Twitter could easily breakthrough millions of tweets per day. It could be computational infeasible to perform information mining from a massive amount of data. Hence, we took an approach in adopting the MapReduce Framework provided by Apache Hadoop. Our model initially pre-processed the tweets by tokenizing them into individual words, where stop words and punctuations were filtered away. Next, we grouped the remaining words into their respective intervals, together with their tweet frequency distributions, and constructed time series signals based on their Document Frequency – Inverse Document Frequency (DF-IDF) vector through chaining a sequence of MapReduce jobs on Hadoop framework. After the data transformation, we computed the auto correlations of each word signals and filter trivial words that are of less importance with a threshold of 0.1 We further calculated the entropy of the word signals to determine the level of randomness, and match it accordingly such that words with low IDF values and narrow entropy (H < 0.25) were also taken away implicitly to better extract only the words that contains burst features in their time series. The outstanding words were then sorted by their auto-correlated coefficient using Hadoop Partition Sorting mechanism, and we introduced a percentile selection on the number of words for event detection. The events detected were then mapped onto a cross correlation matrix based on the bi-words combinations. The words were then represented as a form of adjacency graph, whereby we partition the graph by modularity and clustered words of similar relevance and features to reconstruct events. The events were then evaluated based on their relevance to corresponding real-life events. The computation on our Hadoop cluster gave remarkable results in terms of efficiency and compression data size. The cluster received a beneficial performance of 75% reduction in computational time and at the same time, the MapReduce architecture of Hadoop also reduced the data size by closed to 99% by indexing terms.
URI: http://hdl.handle.net/10356/48540
Schools: School of Computer Engineering 
Rights: Nanyang Technological University
Fulltext Permission: restricted
Fulltext Availability: With Fulltext
Appears in Collections:SCSE Student Reports (FYP/IA/PA/PI)

Files in This Item:
File Description SizeFormat 
sce11-0141.pdf
  Restricted Access
2.42 MBAdobe PDFView/Open

Page view(s) 50

623
Updated on May 25, 2024

Download(s) 50

21
Updated on May 25, 2024

Google ScholarTM

Check

Items in DR-NTU are protected by copyright, with all rights reserved, unless otherwise indicated.