Please use this identifier to cite or link to this item:
Title: Information extraction from bibliography data
Authors: Leong, Kai Ling
Keywords: Engineering::Computer science and engineering
Issue Date: 2022
Publisher: Nanyang Technological University
Source: Leong, K. L. (2022). Information extraction from bibliography data. Final Year Project (FYP), Nanyang Technological University, Singapore.
Abstract: The Computer Science community and research have grown exponentially for the past decade. Analysing trends in research topics is a common practice to keep track of the increasing amount of research and to provide insights to those in the academia, businesses, government, and other stakeholders. However, these trends observed in the Computer Science community may not be reflected in the general audience. This project aims to analyse the trends observed from the field of Computer Science and the general audience, and whether they follow the same trends. The project adhered to the OSEMN framework. The dblp dataset was an XML file downloaded from the dblp website. A StAX parser was implemented to parse the dataset into MySQL database. The data was cleaned and pre-processed and used to implement the LDA model. The final parameters of the LDA model were chosen and the topics were finalised based on their keywords. The topics trends were extracted from these topics. Additional data were queried from Google Ngram Viewer as a second set of topics trends and represent the general audience. Trend analysis was performed using line of best fit, correlations and Mann Kendall trend test. Line of best fit was easy to implement and provided a visual representation of the trends but was statistically unreliable in concluding the topics trends. Although correlations were not intended to be used for trend detection, it was a more statistically reliable approach to determine trends. Mann Kendall trend test was the best approach as it was designed to detect trends, and numerous studies over the years have proposed modified versions to accommodate different limitations and types of data. From the observed trends, about half of the topics of both communities have similar trends. This is not enough evidence to confidently conclude that both communities share the same trends. As such, this project concludes that those in the field of Computer Science and the general audience, do not share the same trends. With this finding, the academia, organisations, and other stakeholders should consider the field of Computer Science and the general audience to be separate communities and may have different growth direction in the observed research topics.
Fulltext Permission: restricted
Fulltext Availability: With Fulltext
Appears in Collections:SCSE Student Reports (FYP/IA/PA/PI)

Files in This Item:
File Description SizeFormat 
  Restricted Access
1.48 MBAdobe PDFView/Open

Page view(s)

Updated on Jul 1, 2022


Updated on Jul 1, 2022

Google ScholarTM


Items in DR-NTU are protected by copyright, with all rights reserved, unless otherwise indicated.