Please use this identifier to cite or link to this item:
Title: Selecting training samples from large and noisy corpora for efficient text classification
Authors: Wong, Daji
Keywords: DRNTU::Engineering::Computer science and engineering::Computing methodologies::Document and text processing
Issue Date: 2009
Abstract: In this thesis, an algorithm is presented that selects samples of documents for training text classifiers. Often the number of documents is very large and the documents are noisy. Both for efficiency purposes and accuracy purposes, one need good samples not just blind samples such as that of simple random sampling. The proposed algorithm is far superior to simple random sampling both for small sampling ratios and in the presence of noise. The proposed algorithm is based on a simple fact that the terms in the set of training sample documents should have approximately equal document frequency as in the whole set (not including the test set).
Description: 59 p.
Rights: Nanyang Technological University
Fulltext Permission: restricted
Fulltext Availability: With Fulltext
Appears in Collections:WKWSCI Theses

Files in This Item:
File Description SizeFormat 
  Restricted Access
6.86 MBAdobe PDFView/Open

Page view(s)

Updated on Feb 28, 2021


Updated on Feb 28, 2021

Google ScholarTM


Items in DR-NTU are protected by copyright, with all rights reserved, unless otherwise indicated.