Please use this identifier to cite or link to this item: https://hdl.handle.net/10356/46343
Title: Implementation of web product search engine : parallel incremental web crawler
Authors: Lwi, Tiong Chai.
Keywords: DRNTU::Engineering::Computer science and engineering::Information systems::Information storage and retrieval
Issue Date: 2011
Abstract: One of the main objectives in designing a Parallel Incremental Web Crawler is to provide a solution to the problem of designing a large scale web-based Content Based Image Retrieval (CBIR) system. Our CBIR system has indexed more than 1 million images crawled from various Business to Consumer (B2C) websites till date. The Internet traffic today is getting more complicated and analyzing how websites are interlinked and their content similarity is important for Web Mining. Due to the growing and dynamic nature of the web, it has poses unprecedented scaling challenges to traverse all URLs in the web documents and handle these URLs, so it has become imperative to parallelize a crawling process for extraction of useful data from the web. In this report, we have proposed a novel architecture of a parallel crawler with an optimization model which is scalable and resilient against system crashes while maximizing the download rate and minimizing the overhead from parallelization based on API and domain specific crawling. We will also discuss how our crawling module is realized to make crawling task more effective and scalable in the collection process of data retrieval without recursive crawling on the same honey pot. We will also be discussing on the storage of extracted data using certain data management techniques and also image processing techniques such as Spatial Anti-Aliasing and enhancing by the crawler when an image is being processed and stored. Finally, several experiments were conducted to evaluate the processed data quality as well as the effectiveness of the algorithms parallel performance in the web crawler. In the experiment, several benchmarking test was also conducted to evaluate the CPU resource utilization as well as the freshness of the eVISE operational database.
URI: http://hdl.handle.net/10356/46343
Rights: Nanyang Technological University
Fulltext Permission: restricted
Fulltext Availability: With Fulltext
Appears in Collections:SCSE Student Reports (FYP/IA/PA/PI)

Files in This Item:
File Description SizeFormat 
SCE10-0400.pdf
  Restricted Access
6.11 MBAdobe PDFView/Open

Page view(s) 50

518
checked on Oct 20, 2020

Download(s) 50

11
checked on Oct 20, 2020

Google ScholarTM

Check

Items in DR-NTU are protected by copyright, with all rights reserved, unless otherwise indicated.